On Jan 9, 5:34 am, "David Jones" <dajh...@hotmail.co.uk> wrote: > "Ray Koopman" wrote in message > > news:firstname.lastname@example.org... > > On Jan 8, 11:56 pm, Paul <paul.domas...@gmail.com> wrote: > > > > > > > I did a simple linear regression (SLR) on 2 equal-length vectors of > > data, then subjected the residuals to a Normal Probability Plot > > (NPP,http://en.wikipedia.org/wiki/Normal_probability_plot). The fit was > > good, and there was no gross concavity, convexity, or "S" shape to > > indicate skew or excess kurtosis. > > > From web browsing, I found that the mean and the standard deviation of > > the normal distribution that is being tested for can be estimated by > > the y-intercept and the slope of the NPP. In other words, a 2nd SLR > > is performed on the residuals NPP scatter graph. > > > I am at a loss as to how to resolve a discrepancy. The estimate of > > *standard deviation* of the residuals comes from slope of the NPP. > > This should correspond to the standard error of the estimate from the > > SLR on the 2 vectors of data. Shouldn't it?? It doesn't. There is a > > notable error of about 5% (N=16 data points, yes I know it's a small > > sample). I don't know which is the correct result. I am using > > Excel's linear regression LINEST, but as will be clear, that's not all > > that relevant to the problem since I can read their documentation to > > ensure that they conform with textbook theory. > > > Part of this discrepancy can be explained by the fact that the formula > > for the normal order statistic means is approximate (see the NPP > > wikipedia page), but I suspect that it's not the main culprit because > > the estimated *mean* of the residuals was highly accurate (in the > > order of 1e-17, ideally zero). > > > I decided to manually calculate the estimate of the standard deviation > > for residuals. This is simply the sum of the square (SS) of the > > residuals (SSres), normalized by the DF, then square-rooted. > > According to the SLR theory, the DF should be N-2 because one degree > > is in getting the mean of the independent variable, and another is > > lost in getting the mean of the dependent variable. I manually > > verified that this is in fact what is done by Excel's SLR. > > > The alternative is to look at the NPP problem completely separately > > from SLR problem. This is simply estimating a population standard > > deviation from a sample. The sample consists for the residuals from > > the SLR problem, this fact is not used. The DF in such a process is > > N-1. > > > For the estimation of standard deviation for the residuals (not just > > for the sample, but for the whole hypothetical population), which DF > > is theoretically correct, N-1 or N-2? As a disclaimer, I should say > > that using N-1 gives a greater discrepancy from the SLR than even the > > NPP yields. So it doesn't really help dispel the discrepancy. Be > > that as it may, however, I'm still interested in what is the > > theoretically correct choice for DF. > > > P.S. I'm not interested in the Maximum Likelihood approach for the > > time being. Better to get a good understanding of the why's for one > > approach before broaching another approach. > > Even if the true regression is linear and the error random > variables are iid normal, the sample residuals are not iid. > Their joint distribution is n-variate normal with zero means > and covariance matrix = (I - H)*sigma^2, where sigma^2 is the > variance of the error distribution and H is the "hat matrix" > associated with the predictors. The expected order statistics > of the residuals are not a simple linear function of the > expected order statistics of n iid normals. > > ------------------------------------------------------------------------ > > A way of understanding/improving the NPP approach in a non-regression > context is to look at L-Estimation ... the use of linear combinations of > order statistics to estimate the parameters of distributions. The slope of > the NPP line is simply one particular linear combination of order > statistics. There is some theory to specify "optimal" estimates. As above, > the theory for this would not be easily transferred for practical use in the > more complicated regression-residual case
Thank you both for your explanations. I do understand the statement that the residuals of the 1st SLR are not IID even though the error distributions are. To fully understand the underlying reasons, I think I need to delve much more into the theory. For the time being, I take this as meaning that using SLR on the NPP is quite the approximation. Can I also take it as meaning that the answer to choosing df=N-1 or df=N-2 is not a simple answer (and may not even make sense), or is there actually a rationale for choosing the lesser of the 2 evils?