The Math Forum

Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Math Forum » Discussions » Inactive » amte

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: another critique of standardized testing to ignore
Replies: 11   Last Post: Aug 22, 2000 9:22 PM

Advanced Search

Back to Topic List Back to Topic List Jump to Tree View Jump to Tree View   Messages: [ Previous | Next ]
Michael Paul Goldenberg

Posts: 7,041
From: Ann Arbor, MI
Registered: 12/3/04
another critique of standardized testing to ignore
Posted: Aug 15, 2000 10:56 PM
  Click to see the message monospaced in plain text Plain Text   Click to reply to this topic Reply
att1.html (29.8 K)

Using Standards and Assessment

Why Standardized Tests Don't Measure Educational Quality

W. James Popham

Educators are experiencing almost relentless pressure to show their
effectiveness. Unfortunately, the chief indicator by which most communities
judge a school staff's success is student performance on standardized
achievement tests.

These days, if a school's standardized test scores are high, people think
the school's staff is effective. If a school's standardized test scores are
low, they see the school's staff as ineffective. In either case, because
educational quality is being measured by the wrong yardstick, those
evaluations are apt to be in error.

One of the chief reasons that students' standardized test scores continue to
be the most important factor in evaluating a school is deceptively simple.
Most educators do not really understand why a standardized test provides a
misleading estimate of a school staff's effectiveness. They should.

What's in a Name?

A standardized test is any examination that's administered and scored in a
predetermined, standard manner. There are two major kinds of standardized
tests: aptitude tests and achievement tests.

Standardized aptitude tests predict how well students are likely to perform
in some subsequent educational setting. The most common examples are the
SAT-I and the ACT both of which attempt to forecast how well high school
students will perform in college.

But standardized achievement-test scores are what citizens and school board
members rely on when they evaluate a school's effectiveness. Nationally,
five such tests are in use: California Achievement Tests, Comprehensive
Tests of Basic Skills, Iowa Tests of Basic Skills, Metropolitan Achievement
Tests, and Stanford Achievement Tests.

A Standardized Test's Assessment Mission

The folks who create standardized achievement tests are terrifically
talented. What they are trying to do is to create assessment tools that
permit someone to make a valid inference about the knowledge and/or skills
that a given student possesses in a particular content area. More precisely,
that inference is to be norm-referenced so that a student's relative
knowledge and/or skills can be compared with those possessed by a national
sample of students of the same age or grade level.

Such relative inferences about a student's status with respect to the
mastery of knowledge and/or skills in a particular subject area can be quite
informative to parents and educators. For example, think about the parents
who discover that their 4th grade child is performing really well in
language arts (94th percentile) and mathematics (89th percentile), but
rather poorly in science (39th percentile) and social studies (26th
percentile). Such information, because it illuminates a child's strengths
and weaknesses, can be helpful not only in dealing with their child's
teacher, but also in determining at-home assistance. Similarly, if teachers
know how their students compare with other students nationwide, they can use
this information to devise appropriate classroom instruction.

But there's an enormous amount of knowledge and/or skills that children at
any grade level are likely to know. The substantial size of the content
domain that a standardized achievement test is supposed to represent poses
genuine difficulties for the developers of such tests. If a test actually
covered all the knowledge and skills in the domain, it would be far too

So standardized achievement tests often need to accomplish their measurement
mission with a much smaller collection of test items than might otherwise be
employed if testing time were not an issue. The way out of this assessment
bind is for standardized achievement tests to sample the knowledge and/or
skills in the content domain. Frequently, such tests try to do their
assessment job with only 40 to 50 items in a subject field--sometimes fewer.

Accurate Differentiation As a Deity

The task for those developing standardized achievement tests is to create an
assessment instrument that, with a handful of items, yields valid
norm-referenced interpretations of a student's status regarding a
substantial chunk of content. Items that do the best job of discriminating
among students are those answered correctly by roughly half the students.
Devlopers avoid items that are answered correctly by too many or by too few

As a consequence of carefully sampling content and concentrating on items
that discriminate optimally among students, these test creators have
produced assessment tools that do a great job of providing relative
comparisons of a student's content mastery with that of students nationwide.
Assuming that the national norm group is genuinely representative of the
nation at large, then educators and parents can make useful inferences about

One of the most useful of those inferences typically deals with students'
relative strengths and weaknesses across subject areas, such as when parents
find that their daughter sparkles in mathematics but sinks in science. It's
also possible to identify students' relative strengths and weaknesses within
a given subject area if there are enough test items to do so. For instance,
if a 45-item standardized test in mathematics allocates 15 items to basic
computation, 15 items to geometry, and 15 items to algebra, it might be
possible to get a rough idea of a student's relative strengths and
weaknesses in those three realms of mathematics. More often than not,
however, these tests contain too few items to allow meaningful
within-subject comparisons of students' strengths and weaknesses.

A second kind of useful inference that can be based on standardized
achievement tests involves a student's growth over time in different subject
areas. For example, let's say that a child is given a standardized
achievement test every third year. We see that the child's percentile
performances in most subjects are relatively similar at each testing, but
that the child's percentiles in mathematics appear to drop dramatically at
each subsequent testing. That's useful information.

Unfortunately, both parents and educators often ascribe far too much
precision and accuracy to students' scores on standardized achievement
tests. Several factors might cause scores to flop about. Merely because
these test scores are reported in numbers (sometimes even with decimals!)
should not incline anyone to attribute unwarranted precision to them.
Standardized achievement test scores should be regarded as rough
approximations of a student's status with respect to the content domain
represented by the test.

To sum up, standardized achievement tests do a wonderful job of supplying
the evidence needed to make norm-referenced interpretations of students'
knowledge and/or skills in relationship to those of students nationally. The
educational usefulness of those interpretations is considerable. Given the
size of the content domains to be represented and the limited number of
items that the test developers have at their disposal, standardized
achievement tests are really quite remarkable. They do what they are
supposed to do.

But standardized achievement tests should not be used to evaluate the
quality of education. That's not what they are supposed to do.

Measuring Temperature with a Tablespoon

For several important reasons, standardized achievement tests should not be
used to judge the quality of education. The overarching reason that
students' scores on these tests do not provide an accurate index of
educational effectiveness is that any inference about educational quality
made on the basis of students' standardized achievement test performances is
apt to be invalid.

Employing standardized achievement tests to ascertain educational quality is
like measuring temperature with a tablespoon. Tablespoons have a different
measurement mission than indicating how hot or cold something is.
Standardized achievement tests have a different measurement mission than
indicating how good or bad a school is. Standardized achievement tests
should be used to make the comparative interpretations that they were
intended to provide. They should not be used to judge educational quality.
Let's look at three significant reasons that it is thoroughly invalid to
base inferences about the caliber of education on standardized achievement
test scores.

Testing-Teaching Mismatches

The companies that create and sell standardized achievement tests are all
owned by large corporations. Like all for-profit businesses, these
corporations attempt to produce revenue for their shareholders.

Recognizing the substantial pressure to sell standardized achievement tests,
those who market such tests encounter a difficult dilemma that arises from
the considerable curricular diversity in the United States. Because
different states often choose somewhat different educational objectives (or,
to be fashionable, different content standards), the need exists to build
standardized achievement tests that are properly aligned with educators'
meaningfully different curricular preferences. The problem becomes even more
exacerbated in states where different counties or school districts can
exercise more localized curricular decision making.

At a very general level, the goals that educators pursue in different
settings are reasonably similar. For instance, you can be sure that all
schools will give attention to language arts, mathematics, and so on. But
that's at a general level. At the level where it really makes a difference
to instruction--in the classroom--there are significant differences in the
educational objectives being sought. And that presents a problem to those
who must sell standardized achievement tests.

In view of the nation's substantial curricular diversity, test developers
are obliged to create a series of one-size-fits-all assessments. But, as
most of us know from attempting to wear one-size-fits-all garments,
sometimes one size really can't fit all.

The designers of these tests do the best job they can in selecting test
items that are likely to measure all of a content area's knowledge and
skills that the nation's educators regard as important. But the test
developers can't really pull it off. Thus, standardized achievement tests
will always contain many items that are not aligned with what's emphasized
instructionally in a particular setting.

To illustrate the seriousness of the mismatch that can occur between what's
taught locally and what's tested through standardized achievement tests,
educators ought to know about an important study at Michigan State
University reported in 1983 by Freeman and his colleagues. These researchers
selected five nationally standardized achievement tests in mathematics and
studied their content for grades 4-6. Then, operating on the very reasonable
assumption that what goes on instructionally in classrooms is often
influenced by what's contained in the texbooks that children use, they also
studied four widely used textbooks for grades 4-6.

Employing rigorous review procedures, the researchers identified the items
in the standardized achievement test that had not received meaningful
instructional attention in the textbooks. They concluded that between 50 and
80 percent of what was measured on the tests was not suitably addressed in
the textbooks. As the Michigan State researchers put it, "The proportion of
topics presented on a standardized test that received more than cursory
treatment in each textbook was never higher than 50 percent" (p. 509).

Well, if the content of standardized tests is not satisfactorily addressed
in widely used textbooks, isn't it likely that in a particular educational
setting, topics will be covered on the test that aren't addressed
instructionally in that setting? Unfortunately, because most educators are
not genuinely familiar with the ingredients of standardized achievement
tests, they often assume that if a standardized achievement test asserts
that it is assessing "children's reading comprehension capabilities," then
it's likely that the test meshes with the way reading is being taught
locally. More often than not, the assumed match between what's tested and
what's taught is not warranted.

If you spend much time with the descriptive materials presented in the
manuals accompanying standardized achievement tests, you'll find that the
descriptors for what's tested are often fairly general. Those descriptors
need to be general to make the tests acceptable to a nation of educators
whose curricular preferences vary. But such general descriptions of what's
tested often permit assumptions of teaching-testing alignments that are way
off the mark. And such mismatches, recognized or not, will often lead to
spurious conclusions about the effectiveness of education in a given setting
if students' scores on standardized achievement tests are used as the
indicator of educational effectiveness. And that's the first reason that
standardized achievement tests should not be used to determine the
effectiveness of a state, a district, a school, or a teacher. There's almost
certain to be a significant mismatch between what's taught and what's

A Psychometric Tendency to Eliminate Important Test Items

A second reason that standardized achievement tests should not be used to
evaluate educational quality arises directly from the requirement that these
tests permit meaningful comparisons among students from only a small
collection of items.

A test item that does the best job in spreading out students' total-test
scores is a test item that's answered correctly by about half the students.
Items that are answered correctly by 40 to 60 percent of the students do a
solid job in spreading out the total scores of test-takers.

Items that are answered correctly by very large numbers of students, in
contrast, do not make a suitable contribution to spreading out students'
test scores. A test item answered correctly by 90 percent of the test-takers
is, from the perspective of a test's efficiency in providing comparative
interpretations, being answered correctly by too many students.

Test items answered correctly by 80 percent or more of the test takers,
therefore, usually don't make it past the final cut when a standardized
achievement test is first developed, and such items will most likely be
jettisoned when the test is revised. As a result, the vast majority of the
items on standardized achievement tests are "middle difficulty" items.

As a consequence of the quest for score variance in a standardized
achievement test, items on which students perform well are often excluded.
However, items on which students perform well often cover the content that,
because of its importance, teachers stress. Thus, the better the job that
teachers do in teaching important knowledge and/or skills, the less likely
it is that there will be items on a standardized achievement test measuring
such knowledge and/or skills. To evaluate teachers' instructional
effectiveness by using assessment tools that deliberately avoid important
content is fundamentally foolish.

Confounded Causation

The third reason that students' performances on these tests should not be
used to evaluate educational quality is the most compelling. Because student
performances on standardized achievement tests are heavily influenced by
three causative factors, only one of which is linked to instructional
quality, asserting that low or high test scores are caused by the quality of
instruction is illogical.

To understand this confounded-causation problem clearly, let's look at the
kinds of test items that appear on standardized achievement tests. Remember,
students' test scores are based on how well students do on the test's items.
To get a really solid idea of what's in standardized tests, you need to grub
around with the items themselves.

The three illustrative items presented here are mildly massaged versions of
actual test items in current standardized achievement tests. I've modified
the items' content slightly, without altering the essence of what the items
are trying to measure.

The problem of confounded causation involves three factors that contribute
to students' scores on standardized achievement tests: (1) what's taught in
school, (2) a student's native intellectual ability, and (3) a student's
out-of-school learning.

What's taught in school. Some of the items in standardized achievement tests
measure the knowledge or skills that students learn in school. In certain
subject areas, such as mathematics, children learn in school most of what
they know about a subject. Few parents spend much time teaching their
children about the intricacies of algebra or how to prove a theorem.

So, if you look over the items in any standardized achievement test, you'll
find a fair number similar to the mathematics item presented in Figure 1,
which is a mildly modified version of an item appearing in a standardized
achievement test intended for 3rd grade children.

This mathematics item would help teachers arrive at a valid inference about
3rd graders' abilities to choose number sentences that coincide with verbal
representations of subtraction problems. Or, along with other similar items
dealing with addition, multiplication, and division, this item would
contribute to a valid inference about a student's ability to choose
appropriate number sentences for a variety of basic computation problems
presented in verbal form.

If the items in standardized achievement tests measured only what actually
had been taught in school, I wouldn't be so negative about using these tests
to determine educational quality. As you'll soon see, however, other kinds
of items are hiding in standardized achievement tests.

A student's native intellectual ability. I wish I believed that all children
were born with identical intellectual abilities, but I don't. Some kids were
luckier at gene-pool time. Some children, from birth, will find it easier to
mess around with mathematics than will others. Some kids, from birth, will
have an easier time with verbal matters than will others. If children came
into the world having inherited identical intellectual abilities, teachers'
pedagogical problems would be far more simple.

Recent thinking among many leading educators suggests that there are various
forms of intelligence, not just one (Gardner, 1994). A child who is born
with less aptitude for dealing with quantitative or verbal tasks, therefore,
might possess greater "interpersonal" or "intrapersonal" intelligence, but
these latter abilities are not tested by these tests. For the kinds of items
that are most commonly found on standardized achievement tests, children
differ in their innate abilities to respond correctly. And some items on
standardized achievement tests are aimed directly at measuring such
intellectual ability.

Consider, for example, the item in Figure 2. This item attempts to measure a
child's ability "to figure out" what the right answer is. I don't think that
the item measures what's taught in school. The item measures what students
come to school with, not what they learn there.

In Figure 2's social studies item for 6th graders, look carefully at the
four answer options. Read each option and see if it might be correct. A
"smart" student, I contend, can figure out that choices A, B, and D really
would not "conserve resources" all that well; hence choice C is the winning
option. Brighter kids will have a better time with this item than their less
bright classmates.

But why, you might be thinking, do developers of standardized tests include
such items on their tests? The answer is all too simple. These sorts of
items, because they tap innate intellectual skills that are not readily
modifiable in school, do a wonderful job in spreading out test-takers'
scores. The quest for score variance, coupled with the limitation of having
few items to use in assessing students, makes such items appealing to those
who construct standardized achievement tests.

But items that primarily measure differences in students' in-born
intellectual abilities obviously do not contribute to valid inferences about
"how well children have been taught." Would we like all children to do well
on such "native-smarts" items? Of course we would. But to use such items to
arrive at a judgment about educational effectiveness is simply unsound.

Out-of-school learning. The most troubling items on standardized achievement
tests assess what students have learned outside of school. Unfortunately,
you'll find more of these items on standardized achievement tests than you'd
suspect. If children come from advantaged families and stimulus-rich
environments, then they are more apt to succeed on items in standardized
achievement test items than will other children whose environments don't
mesh as well with what the tests measure. The item in Figure 3 makes clear
what's actually being assessed by a number of items on standardized
achievement tests.

This 6th grade science item first tells students what an attribute of a
fruit is (namely, that it contains seeds). Then the student must identify
what "is not a fruit" by selecting the option without seeds. As any child
who has encountered celery knows, celery is a seed-free plant. The right
answer, then, for those who have coped with celery's strings but never its
seeds, is clearly choice D.

But what if when you were a youngster, your folks didn't have the money to
buy celery at the store? What if your circumstances simply did not give you
the chance to have meaningful interactions with celery stalks by the time
you hit the 6th grade? How well do you think you'd do in correctly answering
the item in Figure 3? And how well would you do if you didn't know that
pumpkins were seed-carrying spheres? Clearly, if children know about
pumpkins and celery, they'll do better on this item than will those children
who know only about apples and oranges. That's how children's socioeconomic
status gets mixed up with children's performances on standardized
achievement tests. The higher your family's socioeconomic status is, the
more likely you are to do well on a number of the test items you'll
encounter in a such a test.

Suppose you're a principal of a school in which most students come from
genuinely low socioeconomic situations. How are your students likely to
perform on standardized achievement tests if a substantial number of the
test's items really measure the stimulus-richness of your students'
backgrounds? That's right, your students are not likely to earn very high
scores. Does that mean your school's teachers are doing a poor instructional
job? Of course not.

Conversely, let's imagine you're a principal in an affluent school whose
students tend to have upper-class, well-educated parents. Each spring, your
students' scores on standardized achievement tests are dazzlingly high. Does
this mean your school's teachers are doing a super instructional job? Of
course not.

One of the chief reasons that children's socioeconomic status is so highly
correlated with standardized test scores is that many items on standardized
achievement tests really focus on assessing knowledge and/or skills learned
outside of school--knowledge and/or skills more likely to be learned in some
socioeconomic settings than in others.

Again, you might ask why on earth would standardized achievement test
developers place such items on their tests? As usual, the answer is
consistent with the dominant measurement mission of those tests, namely, to
spread out students' test scores so that accurate and fine-grained
norm-referenced interpretations can be made. Because there is substantial
variation in children's socioeconomic situations, items that reflect such
variations are efficient in producing among-student variations in test

You've just considered three important factors that can influence students'
scores on standardized achievement tests. One of these factors was directly
linked to educational quality. But two factors weren't.

What's an Educator to Do?

I've described a situation that, from the perspective of an educator, looks
pretty bleak. What, if anything, can be done? I suggest a three-pronged
attack on the problem. First, I think that you need to learn more about the
viscera of standardized achievement tests. Second, I think that you need to
carry out an effective educational campaign so that your educational
colleagues, parents of children in school, and educational policymakers
understand what the evaluative shortcomings of standardized achievement
tests really are. Finally, I think that you need to arrange a more
appropriate form of assessment-based evidence.

Learning about standardized achievement tests. Far too many educators
haven't really studied the items on standardized achievement tests since the
time that they were, as students, obliged to respond to those items. But the
inferences made on the basis of students' test performances rest on nothing
more than an aggregated sum of students' item-by-item responses. What
educators need to do is to spend some quality time with standardized
achievement tests--scrutinizing the test's items one at a time to see what
they are really measuring.

Spreading the word. Most educators, and almost all parents and school board
members, think that schools should be rated on the basis of their students'
scores on standardized achievement tests. Those people need to be educated.
It is the responsibility of all educators to do that educating.

If you do try to explain to the public, to parents, or to policymakers why
standardized test scores will probably provide a misleading picture of
educational quality, be sure to indicate that you're not running away from
the need to be held accountable. No, you must be willing to identify other,
more credible evidence of student achievement.

Coming up with other evidence. If you're going to argue against standardized
achievement tests as a source of educational evidence for determining school
quality, and you still are willing to be held educationally accountable,
then you'll need to ante up some other form of evidence to show the world
that you really are doing a good educational job.

I recommend that you attempt to assess students' mastery of genuinely
significant cognitive skills, such as their ability to write effective
compositions, their ability to use lessons from history to make cogent
analyses of current problems, and their ability to solve high-level
mathematical problems.

If the skills selected measure really important cognitive outcomes, are seen
by parents and policymakers to be genuinely significant, and can be
addressed instructionally by competent teachers, then the assembly of a set
of pre-test-to-post-test evidence showing substantial student growth in such
skills can be truly persuasive.

What teachers need are assessment instruments that measure worthwhile skills
or significant bodies of knowledge. Then teachers need to show the world
that they can instruct children so that those children make striking
pre-instruction to post-instruction progress.

The fundamental point is this: If educators accept the position that
standardized achievement test scores should not be used to measure the
quality of schooling, then they must provide other, credible evidence that
can be used to ascertain the quality of schooling. Carefully collected,
nonpartisan evidence regarding teachers' pre-test-to-post-test promotion of
undeniably important skills or knowledge just might do the trick.

Right Task, Wrong Tools

Educators should definitely be held accountable. The teaching of a nation's
children is too important to be left unmonitored. But to evaluate
educational quality by using the wrong assessment instruments is a
subversion of good sense. Although educators need to produce valid evidence
regarding their effectiveness, standardized achievement tests are the wrong
tools for the task. *


Freeman, D. J., Kuhs, T. M., Porter, A. C., Floden, R. E., Schmidt, W. H., &
Schwille, J. R. (1983). Do textbooks and tests define a natural curriculum
in elementary school mathematics? Elementary School Journal, 83(5), 501-513.

Gardner, H. (1994). Multiple intelligences: The theory in practice.
Teacher's College Record, 95(4), 576-583.

Author's note: A longer version of this article will appear in the final
chapter of W. James Popham's book Modern Educational Measurement: Practical
Guidelines for Educational Leaders, 3rd ed., (forthcoming); Needham Heights,
MA: Allyn & Bacon.
W. James Popham is a UCLA Emeritus Professor. He may be reached at IOX
Assessment Associates, 5301 Beethoven St., Ste. 190, Los Angeles, CA 90066

Point your RSS reader here for a feed of the latest messages in this topic.

[Privacy Policy] [Terms of Use]

© The Math Forum at NCTM 1994-2018. All Rights Reserved.