How effective is the current test-driven accountability movement? To a
remarkable extent, the only evidence of success offered by proponents is a rise
in scores on the very tests that are being used to mandate change. These
results, however, are said to be meaningful as long as the tests in question are
good-quality, criterion-referenced exams, the Massachusetts Comprehensive
Assessment System exam being a commonly cited example.
My recent research has uncovered two facts about the MCAS that call these
claims into question and raise concerns that ought to reverberate across the
nation. First, a jump in a school's average score from one year to the next is
unlikely to continue and therefore probably does not signal real improvement.
Second, the MCAS is actually designed to produce a certain range of scores—in
effect, artificially limiting how well students can do.
If a school reports higher scores this year than last year, that would of
course be a cause for celebration, if we had reason to believe that the test was
a good measure of the kind of learning regarded as important. But doubts about
the value or validity of the exam—or concerns about what had to be sacrificed
from the curriculum to boost scores— would raise questions from the outset about
whether such a result was indeed good news.
Even putting aside those reservations, however, consider what happens to
schools that proudly report better test results. In a high- profile ceremony at
the Massachusetts Statehouse in December 1999, five school principals were
presented with gifts of $10,000 each for "helping their students make
significant gains on the MCAS." Four of the five were elementary schools, all of
which had reported remarkable increases in average 4th grade math scores from
1998 to 1999. Three of those four schools showed declines the following year.
Was this a fluke? When we look at all the Massachusetts elementary
schools that showed a gain of at least 10 points from 1999 to 2000, we see that
most showed declines in 2001—declines often as large as the gains posted during
the previous year. In fact, a comparison of the changes in 4th grade scores for
all schools (1998 to 1999 vs. 1999 to 2000) finds a statistically
significant negative relationship between the two time periods. A school that
did better the first time was more likely than not to do worse the second time,
and vice versa.
These results don't mean that teachers or students became lazy and tried to
coast on their success. They mean that there was never really evidence of
success at all. Particularly in small schools, as other research has confirmed,
changes in score averages from year to year are poor measures of school quality.
("Republicans
Reject Programs on Facilities, Class Size," May 23, 2001.) If fewer than 100
students are tested in each grade, averages may swing widely from year to year
simply because of the particular samples of students tested and the vagaries of
annual test content and administration.
The other major finding from my research is even more unsettling, providing
one understands the difference between two kinds of standardized tests. Some
tests, those that are called criterion-referenced, measure students against an
absolute standard: how much they know and are able to do. In theory, all
students taking the test might score very high or very low.
Other tests, including the SAT, the Iowa Test of Basic Skills, and the
Stanford Achievement Test, are called norm-referenced, which means they are
concerned with ranking students (or schools) against one another. The results
are reported in relative terms. To learn that a child scored in the 88th
percentile, for example, tells you nothing about how proficient she was, only
what proportion of the population she bested. Half of those who take such tests
will always score below the median.
What's more, the questions on norm-referenced tests are selected not for
their importance (that is, whether they reflect knowledge students should have),
but for their effectiveness in spreading out the scores. Questions that most
students answer correctly will be dropped from these exams and replaced with
those that only about half the students get right.
The MCAS, like other state tests, is widely assumed—even among its critics—to
be a criterion- referenced test. Remarkably, an examination of its technical
manuals reveals that this is not so. Questions for the MCAS are selected and
rejected on the basis of their usefulness in discriminating among test-takers.
For example, pilot test questions answered correctly by a large proportion of
students in 1998 were mostly gone from the operational version of the MCAS in
1999.
This is not just a matter of interest to statisticians. As the author Alfie
Kohn has pointed out, the question driving norm-referenced tests is not "How
well are our students learning?" but "Who's beating whom?" Moreover, when
questions answered correctly by more than 70 percent of students are
systematically excluded from the exam, this guarantees continuing failure. Tests
like the MCAS are designed so that all students can never succeed. Evidence
suggests that other state tests (in Texas, California, and New York, for
example) also have been constructed using norm-referenced test-construction
procedures.
The lesson from this investigation, which just happened to focus on
Massachusetts, is universal: Before newspapers report standardized-test results,
before educators concentrate on trying to raise scores, before politicians allow
these scores to determine the fate of students and schools, and before parents
permit their children to be tested, it ought to be clear just how little a gain
in average scores really means—and what the test was really designed to do.