Date: Dec 11, 2012 1:20 PM
Subject: Multiple regression with all dummy variables
Does a multiple regression with all dummy (indicator) variables make
sense? I work at a state university tutoring various basic subjects
including college algebra, first semester calculus, and a two-semester
"Statistics for Business and Economics" sequence. In recent years my
students have been taught that an alternative to using the ANOVA
technique is to run a multiple regression analysis using all dummy
variables. A recent example given as a study guide for the final exam
was a comparison of used-car prices by color (white, black, blue, or
silver.) Both ANOVA and a multiple regression (with black as the
excluded category) reject the null hypothesis that there is no
difference in prices by color. But the students are then told that the
multiple regression gives more information since we can conclude from
the t-tests on individual coefficients that silver cars sell for more
than the base case (black.) I thought you needed at least one measured
(scalar?) variable among the explanatory variables -- it makes no
sense to do a scatter plot on just a dummy variable, so what on earth
is this "line" (or surface) you are getting from the regression?
So, is having at least one measured explanatory variable a basic
requirement for regression? Has anyone proven that the individual
coefficients on an all-dummy variable regression have no meaning?
Perhaps they follow a well-defined distribution, which might not be
Student's t. Any easy on-line sources? I did not see anything in basic
article on regression in wikipedia.
I'll mention that previously students were taught that, according to
the Central Limit Theorem, if you are doing hypothesis testing on a
mean and you have more than 30 or 40 data points, it's OK to assume
your test statistic is normally rather than t-distributed. They've
abandoned that nonsense, but I'm sceptical about these all-dummy
Thanks for any help!