
Re: Test constantness of normality of residuals from linear regression
Posted:
Jan 10, 2013 2:02 PM


On Jan 10, 6:59 am, Paul <paul.domaskis@gmail.com> wrote: > On Jan 10, 12:48 am, Ray Koopman <koopman@sfu.ca> wrote: >> On Jan 9, 7:35 pm, Paul <paul.domaskis@gmail.com> wrote: >>> After much browsing of Wikipedia and the web, I used both normal >>> probability plot and AndersonDarling to test the normality of >>> residuals from a simple linear regression (SLR) of 6 data points. >>> Results were very good. However, SLR doesn't just assume that the >>> residuals are normal. It assumes that the standard deviation of the >>> PDF that gives rise to the residuals is constant along the horizontal >>> axis. Is there a way to test for this if none of the data points >>> have the same value for the independent variable? I want to be able >>> to show that there is no gross curves or spreading/focusing of the >>> scatter. >>> >>> In electrical engineering signal theory, the horizontal axis is time. >>> Using Fourier Transform (FT), timefrequency domains can show trends. >>> Intuitively, I would set up the data as a scatter graph of residuals >>> plotted against the independent variable (which would be treated as >>> time). Gross curves show up as lowfrequency content. There should >>> be none if residuals are truly iid. The spectrum should look like >>> white noise. The usual way to get the power spectrum is the FT of >>> the autocorrelation function, which itself should resemble an impulse >>> at zero. This just shows indepedence of samples, not constant iid >>> normal along the horizontal axis. >>> >>> As for spreading or narrowing of the scatter, I guess that can be >>> modelled in time as a multiplication of a truly random signal by a >>> linear (or exponential) attenuation function. The latter acts like >>> a modulation envelope. Their power spectrums will then convolve in >>> some weird way. I'm not sure if this is a fruitful direction for >>> identifying trends in the residuals. It starts to get convoluted >>> pretty quickly. >>> >>> Surely there must be a less klugy way from the world of statistics? >>> I realize that my sample size will probably be too small for many >>> conceptual approaches. For example, if I had a wealth of data points, >>> I could segment the horizontal axis, then do a normality test on each >>> segment. This would generate mu's and sigma's as well, which could >>> then be compared across segments. So for the sake of conceptual >>> gratification, I'm hoping for a more elegant test for the ideal case >>> of many data points. If there is also a test for small sample sizes, >>> so much the better (though I don't hold my breath). >> >> If yx = a + b*x + e, where the errors are iid random variables with >> zero means, and you do an ordinary least squares fit of that model to >> (x1,y1), ..., (xn,yn), then the theoretical variance of the residual >> for xi is 1  1/n  [(xim)^2 / sum{(xjm)^2}], ...
That's incomplete. That whole expression needs to be multiplied by the variance of the error distribution.
>> . . . . . . . . . . . . . . . . . . . . . . . . where m is the mean >> of x1, ..., xn. In words, residuals whose x is far from the mean tend >> to be smaller than those whose x is hear the mean. (This is known as >> "leverage": points far from the mean have more "leverage" on the >> regression line, pulling it closer to them.) Note that normality is >> not required > > Ray, > > Thanks for the background. One of the 4 explicit assumptions of > regression is that the PDF for the random errors are normal, according > to Introductory Statistics by Prem S Man (3rd edition). Is this not > correct? This is the reasons I am learngin about normality tests, and > especially about the constantness of the PDF along the horizontal axis.
It all depends on what you want. Look up the GaussMarkov theorem. To justify the usual OLS estimates of the regression coefficients, the errors need only to be unbiased, uncorrelated, and homoscedastic, but to justify all the usual pvalues and confidence regions, the errors must be iid normal.
However, that's considering only the theoretical justification. In practice, what matters is not whether the assumptions are right or wrong, but how wrong they are  they're never exactly right.
Normality is probably the least important assumption. The most important things to worry about are the general form of the model and whether it includes all the relevant predictor variables. Then you ask how correlated and/or heteroscedastic the errors might be. Finally, you might wonder about shapes of the error distributions. Minor departures from normality are inconsequential. Nothing in the real world is exactly normal, and any test of normality will reject if the sample size is big enough.

