The Math Forum

Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Math Forum » Discussions » Courses » ap-stat

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: Handbook of Small Data Sets
Replies: 0  

Advanced Search

Back to Topic List Back to Topic List

Posts: 54
Registered: 12/6/04
Handbook of Small Data Sets
Posted: Sep 5, 1996 11:12 AM
  Click to see the message monospaced in plain text Plain Text   Click to reply to this topic Reply

Gregg Drube asked about The Handbook of Small Data Sets.
I have not used it, but Bob Hayden wrote a review of it
for the fall 1995 issue of the STN newsletter. I have
copied the review (in its straight text format) below.
By the way, I called Chapman Hall. The price of the book
is now $68.95.

Jerry Moreno, Editor of the STN newsletter
John Carroll University

Book Review................

A Handbook of Small Data Sets
by D.J. Hand, F. Daly, A.D. Lunn,
K. McConway, and E. Ostrowski,
(1994) ISBN 0-412-39920-2,
Chapman Hall (800)842-3636 $64.95 U.S.

Real statisticians do not analyze fake data, so the data our
students work with should usually be real. Having students
collect data of interest to them is a way to increase their
motivation. It is also important for students to do some
data-gathering because otherwise they will not learn about the
first half of a statistical investigation: formulating a
question, designing a study, and gathering data. However,
student-gathered data may not illustrate the varied and important
uses of statistics in society, and it may not always illustrate
the points we want to make in class. It also takes a lot of time
to gather data. For that reason, it is often desirable to use
real data that have already been gathered by someone else.
However, not many of us have a mass of appropriate data of our
own to share with our students. There have been a number of
attempts to provide statistics teachers with data sets, as part
of a textbook, in a supplement to a textbook, or in separate
collections. The book under review is my own favorite
collection. Unfortunately, it does have some problems, and I
feel a need to dwell on those in order to make them less of a
problem for you.

One of the things that often discourages people from using
technology is the vast collection of difficulties that crop up
during your first attempt. This collection is a bit clumsy to
use, it has an error-prone index, and the data disk is so badly
scrambled that it would take weeks to straighten it out. Still,
my hope is that you will buy this book, because it really is a
wonderful collection of data sets. Just be forewarned that there
may be some glitches, especially with the data disk. I'll spend
some time on the problems with the disk, so you can check for and
correct any problems in data sets you might use with your
students. With those caveats in mind, let's turn to the book's

Among the things I like about it are:

there are over 500 data sets;

all data sets are provided on a disk with the book;

the context of the data is usually clearly explained,
and meaningful to a layperson;

many of the studies are ones that could easily be
replicated in class or as student projects;

there are references to the source (and sometimes to
published analyses) of the data;

some (too few) have suggestions on how the data
might be used in teaching; and,

a wide range of application areas are included.

Let me mention just one data set by way of an example. At about
the time Australia converted to the metric system, a college
instructor asked two classes meeting in the same room to estimate
the dimensions of the room. One class was asked to do
estimates in feet, and the other in meters. The results
illustrate the concepts of bias and variability in data: the
estimates in meters vary much more than those in feet, and they
are not "correct on the average."

On the down side, for too many of the data sets you are left to
guess what it is you are supposed to find out from the data. Even
when there is some clear purpose to the study, it is often not
one of much real importance. The data sets can be used to
illustrate statistics, but not always to show its importance.
Finally, few of the data sets are random samples from a
well-defined population, and so are not suitable for illustrating
inference techniques. They are more useful for exploratory data

All the rest of the things I don't like have to do with the
mechanics of locating and using the data sets you might want.
They are listed in random order in the book. You will need to
turn to a table in the back of the book to discover the name of
the computer file containing the data. (There is no systematic
naming system.) When you try to read one of those files into
your software, you may be in for a surprise. One good use for
fake data is to create a very simple illustration. Here is an
illustration of the kinds of problems you will find on the data
disk. (The problems are not made up, and, yes, all these
problems did occur in a single file, and similar problems can be
found in most of the other files!) Made-up data on pianists
might appear in the book in a table like this.

*Bachauer 23 51 Richter 32 52
*Haskil 12 33 Rubinstein 23 44
Lipatti 43 45

For each pianist, we have measurements on two variables. In
addition, the asterisks denote female pianists. On the disk, the
data might look like this:

23 51 32 52
23 44 12 33
43 45

The names and sexes of the pianists have been lost. A less
obvious problem is that most statistical packages will interpret
this data file as having four measurements on each of three
subjects -- except that two measurements appear to be missing for
the last subject. This may cause an error message or it may cause
the package to refuse the data. Even if it does not, and you ask
for the mean of the first variable, you will get the mean of
three numbers, not five. If you give data files like this to
your students, you will need to drastically increase your life
insurance coverage.

To add the missing information, you could try typing in 0's and
1's to represent male and female respectively, but when you do,
you may discover (or worse, you may not!) that Haskil and
Rubinstein have been switched on the disk. Assuming you are
content to keep the order on the disk, the data file would look
like this after you finish editing it.

1 23 51
0 23 44
1 43 45
1 32 52
0 12 33

This is a lot of work! Perhaps those of us who use the book can
share cleaned up versions of the data files, and/or convince the
publisher to do some cleaning. Still, the book is great for
browsing, and it's great to have the data on disk in any form!

I could not find any clue on the disk or the book's cover what
kind of computer might be able to read the disk, but it looked
like a 720k DOS disk to my PC clone. I doubt you could read this
with an 800k Mac drive; I'm not sure about the higher density Mac
drives. The files take up about 500 times the smallest chunk of
disk space you can allocate -- about 0.5Mb on the floppy
provided, about 4Mb on my 340 Mb hard disk. That's a lot of
space. I have heard tales of some systems going off to
never-never land when asked to convert the more than 500 DOS
files to Mac files. Since most of the files are unusable in
their current state anyway, it is probably best to work with just
one at a time, converting, editing, moving to your hard disk, and
importing to your stats package as needed.

Despite the problems, I still think it is a great book. The
publishers could rectify the problems by cleaning up the data
sets on the disk and adding to the disk a proofread (I found too
many errors) and corrected version of the data index in an
electronic form that one could sort and search. (I wrote them
about this at the time I agreed to write this review, and have
not heard from them when the deadline arrived ten weeks later.)
Even if you threw out the disk and index and only used 20 of the
given data sets (typing them in yourself), the book would be


I list but two titles. Singer and Willett give an extensive
bibliography of data sources. The other papers in the collection
are worth reading as well. The book by Chatterjee et al. is
listed because it is too recent to have been included in Singer
and Willett. It is at a considerably higher level than the book
reviewed above, but has the virtues of including extended analyses
of many of the data sets and more examples suited to inferential

Chatterjee, S., Handcock, M.S., Simonoff, J.S. (1995), A
Casebook for a First Course in Statistics and Data Analysis,
John Wiley and Sons, New York, ISBN 0-471-11030-2.

Singer, J.D., and Willett, J.B. (1992), "Annotated Bibliography
of Sources of Real-World Datasets Useful for Teaching Applied
Statistics", in Gordon, F., and Gordon, S., Eds., (1992),
Statistics for the Twenty-First Century, Mathematical
Association of America, 1529 18th St., NW, Washington, DC 20036,
ISBN 0-88385-078-8.

Reviewed by
Robert W. Hayden
Plymouth State College
Plymouth, New Hampshire


Point your RSS reader here for a feed of the latest messages in this topic.

[Privacy Policy] [Terms of Use]

© The Math Forum at NCTM 1994-2018. All Rights Reserved.