"Ray Koopman" wrote in message news:firstname.lastname@example.org...
On Jan 8, 11:56 pm, Paul <paul.domas...@gmail.com> wrote: > I did a simple linear regression (SLR) on 2 equal-length vectors of > data, then subjected the residuals to a Normal Probability Plot > (NPP,http://en.wikipedia.org/wiki/Normal_probability_plot). The fit was > good, and there was no gross concavity, convexity, or "S" shape to > indicate skew or excess kurtosis. > > From web browsing, I found that the mean and the standard deviation of > the normal distribution that is being tested for can be estimated by > the y-intercept and the slope of the NPP. In other words, a 2nd SLR > is performed on the residuals NPP scatter graph. > > I am at a loss as to how to resolve a discrepancy. The estimate of > *standard deviation* of the residuals comes from slope of the NPP. > This should correspond to the standard error of the estimate from the > SLR on the 2 vectors of data. Shouldn't it?? It doesn't. There is a > notable error of about 5% (N=16 data points, yes I know it's a small > sample). I don't know which is the correct result. I am using > Excel's linear regression LINEST, but as will be clear, that's not all > that relevant to the problem since I can read their documentation to > ensure that they conform with textbook theory. > > Part of this discrepancy can be explained by the fact that the formula > for the normal order statistic means is approximate (see the NPP > wikipedia page), but I suspect that it's not the main culprit because > the estimated *mean* of the residuals was highly accurate (in the > order of 1e-17, ideally zero). > > I decided to manually calculate the estimate of the standard deviation > for residuals. This is simply the sum of the square (SS) of the > residuals (SSres), normalized by the DF, then square-rooted. > According to the SLR theory, the DF should be N-2 because one degree > is in getting the mean of the independent variable, and another is > lost in getting the mean of the dependent variable. I manually > verified that this is in fact what is done by Excel's SLR. > > The alternative is to look at the NPP problem completely separately > from SLR problem. This is simply estimating a population standard > deviation from a sample. The sample consists for the residuals from > the SLR problem, this fact is not used. The DF in such a process is > N-1. > > For the estimation of standard deviation for the residuals (not just > for the sample, but for the whole hypothetical population), which DF > is theoretically correct, N-1 or N-2? As a disclaimer, I should say > that using N-1 gives a greater discrepancy from the SLR than even the > NPP yields. So it doesn't really help dispel the discrepancy. Be > that as it may, however, I'm still interested in what is the > theoretically correct choice for DF. > > P.S. I'm not interested in the Maximum Likelihood approach for the > time being. Better to get a good understanding of the why's for one > approach before broaching another approach.
Even if the true regression is linear and the error random variables are iid normal, the sample residuals are not iid. Their joint distribution is n-variate normal with zero means and covariance matrix = (I - H)*sigma^2, where sigma^2 is the variance of the error distribution and H is the "hat matrix" associated with the predictors. The expected order statistics of the residuals are not a simple linear function of the expected order statistics of n iid normals.
A way of understanding/improving the NPP approach in a non-regression context is to look at L-Estimation ... the use of linear combinations of order statistics to estimate the parameters of distributions. The slope of the NPP line is simply one particular linear combination of order statistics. There is some theory to specify "optimal" estimates. As above, the theory for this would not be easily transferred for practical use in the more complicated regression-residual case