Does a multiple regression with all dummy (indicator) variables make sense? I work at a state university tutoring various basic subjects including college algebra, first semester calculus, and a two-semester "Statistics for Business and Economics" sequence. In recent years my students have been taught that an alternative to using the ANOVA technique is to run a multiple regression analysis using all dummy variables. A recent example given as a study guide for the final exam was a comparison of used-car prices by color (white, black, blue, or silver.) Both ANOVA and a multiple regression (with black as the excluded category) reject the null hypothesis that there is no difference in prices by color. But the students are then told that the multiple regression gives more information since we can conclude from the t-tests on individual coefficients that silver cars sell for more than the base case (black.) I thought you needed at least one measured (scalar?) variable among the explanatory variables -- it makes no sense to do a scatter plot on just a dummy variable, so what on earth is this "line" (or surface) you are getting from the regression?
So, is having at least one measured explanatory variable a basic requirement for regression? Has anyone proven that the individual coefficients on an all-dummy variable regression have no meaning? Perhaps they follow a well-defined distribution, which might not be Student's t. Any easy on-line sources? I did not see anything in basic article on regression in wikipedia.
I'll mention that previously students were taught that, according to the Central Limit Theorem, if you are doing hypothesis testing on a mean and you have more than 30 or 40 data points, it's OK to assume your test statistic is normally rather than t-distributed. They've abandoned that nonsense, but I'm sceptical about these all-dummy regressions.