"Paul" <email@example.com> wrote in message news:firstname.lastname@example.org... On Jan 9, 5:34 am, "David Jones" <dajh...@hotmail.co.uk> wrote: > "Ray Koopman" wrote in message > > news:email@example.com... > > On Jan 8, 11:56 pm, Paul <paul.domas...@gmail.com> wrote: > > > > > > > I did a simple linear regression (SLR) on 2 equal-length vectors of > > data, then subjected the residuals to a Normal Probability Plot > > (NPP,http://en.wikipedia.org/wiki/Normal_probability_plot). The fit was > > good, and there was no gross concavity, convexity, or "S" shape to > > indicate skew or excess kurtosis. > > > From web browsing, I found that the mean and the standard deviation of > > the normal distribution that is being tested for can be estimated by > > the y-intercept and the slope of the NPP. In other words, a 2nd SLR > > is performed on the residuals NPP scatter graph. > > > I am at a loss as to how to resolve a discrepancy. The estimate of > > *standard deviation* of the residuals comes from slope of the NPP. > > This should correspond to the standard error of the estimate from the > > SLR on the 2 vectors of data. Shouldn't it?? It doesn't. There is a > > notable error of about 5% (N=16 data points, yes I know it's a small > > sample). I don't know which is the correct result. I am using > > Excel's linear regression LINEST, but as will be clear, that's not all > > that relevant to the problem since I can read their documentation to > > ensure that they conform with textbook theory. > > > Part of this discrepancy can be explained by the fact that the formula > > for the normal order statistic means is approximate (see the NPP > > wikipedia page), but I suspect that it's not the main culprit because > > the estimated *mean* of the residuals was highly accurate (in the > > order of 1e-17, ideally zero). > > > I decided to manually calculate the estimate of the standard deviation > > for residuals. This is simply the sum of the square (SS) of the > > residuals (SSres), normalized by the DF, then square-rooted. > > According to the SLR theory, the DF should be N-2 because one degree > > is in getting the mean of the independent variable, and another is > > lost in getting the mean of the dependent variable. I manually > > verified that this is in fact what is done by Excel's SLR. > > > The alternative is to look at the NPP problem completely separately > > from SLR problem. This is simply estimating a population standard > > deviation from a sample. The sample consists for the residuals from > > the SLR problem, this fact is not used. The DF in such a process is > > N-1. > > > For the estimation of standard deviation for the residuals (not just > > for the sample, but for the whole hypothetical population), which DF > > is theoretically correct, N-1 or N-2? As a disclaimer, I should say > > that using N-1 gives a greater discrepancy from the SLR than even the > > NPP yields. So it doesn't really help dispel the discrepancy. Be > > that as it may, however, I'm still interested in what is the > > theoretically correct choice for DF. > > > P.S. I'm not interested in the Maximum Likelihood approach for the > > time being. Better to get a good understanding of the why's for one > > approach before broaching another approach. > > Even if the true regression is linear and the error random > variables are iid normal, the sample residuals are not iid. > Their joint distribution is n-variate normal with zero means > and covariance matrix = (I - H)*sigma^2, where sigma^2 is the > variance of the error distribution and H is the "hat matrix" > associated with the predictors. The expected order statistics > of the residuals are not a simple linear function of the > expected order statistics of n iid normals. > > ------------------------------------------------------------------------ > > A way of understanding/improving the NPP approach in a non-regression > context is to look at L-Estimation ... the use of linear combinations of > order statistics to estimate the parameters of distributions. The slope of > the NPP line is simply one particular linear combination of order > statistics. There is some theory to specify "optimal" estimates. As above, > the theory for this would not be easily transferred for practical use in > the > more complicated regression-residual case
Thank you both for your explanations. I do understand the statement that the residuals of the 1st SLR are not IID even though the error distributions are. To fully understand the underlying reasons, I think I need to delve much more into the theory. For the time being, I take this as meaning that using SLR on the NPP is quite the approximation. Can I also take it as meaning that the answer to choosing df=N-1 or df=N-2 is not a simple answer (and may not even make sense), or is there actually a rationale for choosing the lesser of the 2 evils?
There are many things going on here. At the most basic level, and in the non-regression case, you might want to consider that the "aim" of the slope estimate in the Normal Probability Plot procedure is to provide an unbiased estimate of the standard deviation (not the variance). In L-estimation you get a minimum-variance (among linear combinations of order statistics) unbiased estimate of the standard deviation.. In ordinary normal-theory estimation, you get an unbiased estimate of the variance (and the degrees of freedom adjustment leads to the unbiasedness), but not a minimum-variance one.