On Jan 8, 11:56 pm, Paul <paul.domas...@gmail.com> wrote: > I did a simple linear regression (SLR) on 2 equal-length vectors of > data, then subjected the residuals to a Normal Probability Plot (NPP,http://en.wikipedia.org/wiki/Normal_probability_plot). The fit was > good, and there was no gross concavity, convexity, or "S" shape to > indicate skew or excess kurtosis. > > From web browsing, I found that the mean and the standard deviation of > the normal distribution that is being tested for can be estimated by > the y-intercept and the slope of the NPP. In other words, a 2nd SLR > is performed on the residuals NPP scatter graph. > > I am at a loss as to how to resolve a discrepancy. The estimate of > *standard deviation* of the residuals comes from slope of the NPP. > This should correspond to the standard error of the estimate from the > SLR on the 2 vectors of data. Shouldn't it?? It doesn't. There is a > notable error of about 5% (N=16 data points, yes I know it's a small > sample). I don't know which is the correct result. I am using > Excel's linear regression LINEST, but as will be clear, that's not all > that relevant to the problem since I can read their documentation to > ensure that they conform with textbook theory. > > Part of this discrepancy can be explained by the fact that the formula > for the normal order statistic means is approximate (see the NPP > wikipedia page), but I suspect that it's not the main culprit because > the estimated *mean* of the residuals was highly accurate (in the > order of 1e-17, ideally zero). > > I decided to manually calculate the estimate of the standard deviation > for residuals. This is simply the sum of the square (SS) of the > residuals (SSres), normalized by the DF, then square-rooted. > According to the SLR theory, the DF should be N-2 because one degree > is in getting the mean of the independent variable, and another is > lost in getting the mean of the dependent variable. I manually > verified that this is in fact what is done by Excel's SLR. > > The alternative is to look at the NPP problem completely separately > from SLR problem. This is simply estimating a population standard > deviation from a sample. The sample consists for the residuals from > the SLR problem, this fact is not used. The DF in such a process is > N-1. > > For the estimation of standard deviation for the residuals (not just > for the sample, but for the whole hypothetical population), which DF > is theoretically correct, N-1 or N-2? As a disclaimer, I should say > that using N-1 gives a greater discrepancy from the SLR than even the > NPP yields. So it doesn't really help dispel the discrepancy. Be > that as it may, however, I'm still interested in what is the > theoretically correct choice for DF. > > P.S. I'm not interested in the Maximum Likelihood approach for the > time being. Better to get a good understanding of the why's for one > approach before broaching another approach.
Even if the true regression is linear and the error random variables are iid normal, the sample residuals are not iid. Their joint distribution is n-variate normal with zero means and covariance matrix = (I - H)*sigma^2, where sigma^2 is the variance of the error distribution and H is the "hat matrix" associated with the predictors. The expected order statistics of the residuals are not a simple linear function of the expected order statistics of n iid normals.