Gregg Drube asked about The Handbook of Small Data Sets. I have not used it, but Bob Hayden wrote a review of it for the fall 1995 issue of the STN newsletter. I have copied the review (in its straight text format) below. By the way, I called Chapman Hall. The price of the book is now $68.95.
Jerry Moreno, Editor of the STN newsletter John Carroll University email@example.com
*************************************************************** Book Review................
A Handbook of Small Data Sets by D.J. Hand, F. Daly, A.D. Lunn, K. McConway, and E. Ostrowski, (1994) ISBN 0-412-39920-2, Chapman Hall (800)842-3636 $64.95 U.S.
Real statisticians do not analyze fake data, so the data our students work with should usually be real. Having students collect data of interest to them is a way to increase their motivation. It is also important for students to do some data-gathering because otherwise they will not learn about the first half of a statistical investigation: formulating a question, designing a study, and gathering data. However, student-gathered data may not illustrate the varied and important uses of statistics in society, and it may not always illustrate the points we want to make in class. It also takes a lot of time to gather data. For that reason, it is often desirable to use real data that have already been gathered by someone else. However, not many of us have a mass of appropriate data of our own to share with our students. There have been a number of attempts to provide statistics teachers with data sets, as part of a textbook, in a supplement to a textbook, or in separate collections. The book under review is my own favorite collection. Unfortunately, it does have some problems, and I feel a need to dwell on those in order to make them less of a problem for you.
One of the things that often discourages people from using technology is the vast collection of difficulties that crop up during your first attempt. This collection is a bit clumsy to use, it has an error-prone index, and the data disk is so badly scrambled that it would take weeks to straighten it out. Still, my hope is that you will buy this book, because it really is a wonderful collection of data sets. Just be forewarned that there may be some glitches, especially with the data disk. I'll spend some time on the problems with the disk, so you can check for and correct any problems in data sets you might use with your students. With those caveats in mind, let's turn to the book's virtues.
Among the things I like about it are:
there are over 500 data sets;
all data sets are provided on a disk with the book;
the context of the data is usually clearly explained, and meaningful to a layperson;
many of the studies are ones that could easily be replicated in class or as student projects;
there are references to the source (and sometimes to published analyses) of the data;
some (too few) have suggestions on how the data might be used in teaching; and,
a wide range of application areas are included.
Let me mention just one data set by way of an example. At about the time Australia converted to the metric system, a college instructor asked two classes meeting in the same room to estimate the dimensions of the room. One class was asked to do estimates in feet, and the other in meters. The results illustrate the concepts of bias and variability in data: the estimates in meters vary much more than those in feet, and they are not "correct on the average."
On the down side, for too many of the data sets you are left to guess what it is you are supposed to find out from the data. Even when there is some clear purpose to the study, it is often not one of much real importance. The data sets can be used to illustrate statistics, but not always to show its importance. Finally, few of the data sets are random samples from a well-defined population, and so are not suitable for illustrating inference techniques. They are more useful for exploratory data analysis.
All the rest of the things I don't like have to do with the mechanics of locating and using the data sets you might want. They are listed in random order in the book. You will need to turn to a table in the back of the book to discover the name of the computer file containing the data. (There is no systematic naming system.) When you try to read one of those files into your software, you may be in for a surprise. One good use for fake data is to create a very simple illustration. Here is an illustration of the kinds of problems you will find on the data disk. (The problems are not made up, and, yes, all these problems did occur in a single file, and similar problems can be found in most of the other files!) Made-up data on pianists might appear in the book in a table like this.
For each pianist, we have measurements on two variables. In addition, the asterisks denote female pianists. On the disk, the data might look like this:
23 51 32 52 23 44 12 33 43 45
The names and sexes of the pianists have been lost. A less obvious problem is that most statistical packages will interpret this data file as having four measurements on each of three subjects -- except that two measurements appear to be missing for the last subject. This may cause an error message or it may cause the package to refuse the data. Even if it does not, and you ask for the mean of the first variable, you will get the mean of three numbers, not five. If you give data files like this to your students, you will need to drastically increase your life insurance coverage.
To add the missing information, you could try typing in 0's and 1's to represent male and female respectively, but when you do, you may discover (or worse, you may not!) that Haskil and Rubinstein have been switched on the disk. Assuming you are content to keep the order on the disk, the data file would look like this after you finish editing it.
1 23 51 0 23 44 1 43 45 1 32 52 0 12 33
This is a lot of work! Perhaps those of us who use the book can share cleaned up versions of the data files, and/or convince the publisher to do some cleaning. Still, the book is great for browsing, and it's great to have the data on disk in any form!
I could not find any clue on the disk or the book's cover what kind of computer might be able to read the disk, but it looked like a 720k DOS disk to my PC clone. I doubt you could read this with an 800k Mac drive; I'm not sure about the higher density Mac drives. The files take up about 500 times the smallest chunk of disk space you can allocate -- about 0.5Mb on the floppy provided, about 4Mb on my 340 Mb hard disk. That's a lot of space. I have heard tales of some systems going off to never-never land when asked to convert the more than 500 DOS files to Mac files. Since most of the files are unusable in their current state anyway, it is probably best to work with just one at a time, converting, editing, moving to your hard disk, and importing to your stats package as needed.
Despite the problems, I still think it is a great book. The publishers could rectify the problems by cleaning up the data sets on the disk and adding to the disk a proofread (I found too many errors) and corrected version of the data index in an electronic form that one could sort and search. (I wrote them about this at the time I agreed to write this review, and have not heard from them when the deadline arrived ten weeks later.) Even if you threw out the disk and index and only used 20 of the given data sets (typing them in yourself), the book would be worthwhile.
I list but two titles. Singer and Willett give an extensive bibliography of data sources. The other papers in the collection are worth reading as well. The book by Chatterjee et al. is listed because it is too recent to have been included in Singer and Willett. It is at a considerably higher level than the book reviewed above, but has the virtues of including extended analyses of many of the data sets and more examples suited to inferential techniques.
Chatterjee, S., Handcock, M.S., Simonoff, J.S. (1995), A Casebook for a First Course in Statistics and Data Analysis, John Wiley and Sons, New York, ISBN 0-471-11030-2.
Singer, J.D., and Willett, J.B. (1992), "Annotated Bibliography of Sources of Real-World Datasets Useful for Teaching Applied Statistics", in Gordon, F., and Gordon, S., Eds., (1992), Statistics for the Twenty-First Century, Mathematical Association of America, 1529 18th St., NW, Washington, DC 20036, ISBN 0-88385-078-8.
Reviewed by Robert W. Hayden Plymouth State College Plymouth, New Hampshire firstname.lastname@example.org,edu.