Date: Jan 9, 2013 11:26 AM
Author: Paul
Subject: Re: Degrees of freedom (DF) in normality test

On Jan 9, 5:34 am, "David Jones" <dajh...@hotmail.co.uk> wrote:
> "Ray Koopman"  wrote in message
>
>
> On Jan 8, 11:56 pm, Paul <paul.domas...@gmail.com> wrote:
>
>
>
>
>

> > I did a simple linear regression (SLR) on 2 equal-length vectors of
> > data, then subjected the residuals to a Normal Probability Plot
> > (NPP,http://en.wikipedia.org/wiki/Normal_probability_plot).  The fit was
> > good, and there was no gross concavity, convexity, or "S" shape to
> > indicate skew or excess kurtosis.

>
> > From web browsing, I found that the mean and the standard deviation of
> > the normal distribution that is being tested for can be estimated by
> > the y-intercept and the slope of the NPP.  In other words, a 2nd SLR
> > is performed on the residuals NPP scatter graph.

>
> > I am at a loss as to how to resolve a discrepancy.  The estimate of
> > *standard deviation* of the residuals comes from slope of the NPP.
> > This should correspond to the standard error of the estimate from the
> > SLR on the 2 vectors of data.  Shouldn't it??  It doesn't.  There is a
> > notable error of about 5% (N=16 data points, yes I know it's a small
> > sample).  I don't know which is the correct result.  I am using
> > Excel's linear regression LINEST, but as will be clear, that's not all
> > that relevant to the problem since I can read their documentation to
> > ensure that they conform with textbook theory.

>
> > Part of this discrepancy can be explained by the fact that the formula
> > for the normal order statistic means is approximate (see the NPP
> > wikipedia page), but I suspect that it's not the main culprit because
> > the estimated *mean* of the residuals was highly accurate (in the
> > order of 1e-17, ideally zero).

>
> > I decided to manually calculate the estimate of the standard deviation
> > for residuals.  This is simply the sum of the square (SS) of the
> > residuals (SSres), normalized by the DF, then square-rooted.
> > According to the SLR theory, the DF should be N-2 because one degree
> > is in getting the mean of the independent variable, and another is
> > lost in getting the mean of the dependent variable.  I manually
> > verified  that this is in fact what is done by Excel's SLR.

>
> > The alternative is to look at the NPP problem completely separately
> > from SLR problem.  This is simply estimating a population standard
> > deviation from a sample.  The sample consists for the residuals from
> > the SLR problem, this fact is not used.  The DF in such a process is
> > N-1.

>
> > For the estimation of standard deviation for the residuals (not just
> > for the sample, but for the whole hypothetical population), which DF
> > is theoretically correct, N-1 or N-2?  As a disclaimer, I should say
> > that using N-1 gives a greater discrepancy from the SLR than even the
> > NPP yields.  So it doesn't really help dispel the discrepancy.  Be
> > that as it may, however, I'm still interested in what is the
> > theoretically correct choice for DF.

>
> > P.S.  I'm not interested in the Maximum Likelihood approach for the
> > time being.  Better to get a good understanding of the why's for one
> > approach before broaching another approach.

>
> Even if the true regression is linear and the error random
> variables are iid normal, the sample residuals are not iid.
> Their joint distribution is n-variate normal with zero means
> and covariance matrix = (I - H)*sigma^2, where sigma^2 is the
> variance of the error distribution and H is the "hat matrix"
> associated with the predictors. The expected order statistics
> of the residuals are not a simple linear function of the
> expected order statistics of n iid normals.
>
> ------------------------------------------------------------------------
>
> A way of understanding/improving the NPP approach in a non-regression
> context is to look at L-Estimation ... the use of linear combinations of
> order statistics to estimate the parameters of distributions.  The slope of
> the NPP line is simply one particular linear combination of order
> statistics. There is some theory to specify "optimal" estimates. As above,
> the theory for this would not be easily transferred for practical use in the
> more complicated regression-residual  case

Ray, David,

Thank you both for your explanations. I do understand the statement
that the residuals of the 1st SLR are not IID even though the error
distributions are. To fully understand the underlying reasons, I
think I need to delve much more into the theory. For the time being,
I take this as meaning that using SLR on the NPP is quite the
approximation. Can I also take it as meaning that the answer to
choosing df=N-1 or df=N-2 is not a simple answer (and may not even
make sense), or is there actually a rationale for choosing the lesser
of the 2 evils?