Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: Degrees of freedom (DF) in normality test
Replies: 4   Last Post: Jan 10, 2013 6:45 AM

 Messages: [ Previous | Next ]
 Ray Koopman Posts: 3,383 Registered: 12/7/04
Re: Degrees of freedom (DF) in normality test
Posted: Jan 9, 2013 3:31 AM

On Jan 8, 11:56 pm, Paul <paul.domas...@gmail.com> wrote:
> I did a simple linear regression (SLR) on 2 equal-length vectors of
> data, then subjected the residuals to a Normal Probability Plot (NPP,http://en.wikipedia.org/wiki/Normal_probability_plot). The fit was
> good, and there was no gross concavity, convexity, or "S" shape to
> indicate skew or excess kurtosis.
>
> From web browsing, I found that the mean and the standard deviation of
> the normal distribution that is being tested for can be estimated by
> the y-intercept and the slope of the NPP. In other words, a 2nd SLR
> is performed on the residuals NPP scatter graph.
>
> I am at a loss as to how to resolve a discrepancy. The estimate of
> *standard deviation* of the residuals comes from slope of the NPP.
> This should correspond to the standard error of the estimate from the
> SLR on the 2 vectors of data. Shouldn't it?? It doesn't. There is a
> notable error of about 5% (N=16 data points, yes I know it's a small
> sample). I don't know which is the correct result. I am using
> Excel's linear regression LINEST, but as will be clear, that's not all
> that relevant to the problem since I can read their documentation to
> ensure that they conform with textbook theory.
>
> Part of this discrepancy can be explained by the fact that the formula
> for the normal order statistic means is approximate (see the NPP
> wikipedia page), but I suspect that it's not the main culprit because
> the estimated *mean* of the residuals was highly accurate (in the
> order of 1e-17, ideally zero).
>
> I decided to manually calculate the estimate of the standard deviation
> for residuals. This is simply the sum of the square (SS) of the
> residuals (SSres), normalized by the DF, then square-rooted.
> According to the SLR theory, the DF should be N-2 because one degree
> is in getting the mean of the independent variable, and another is
> lost in getting the mean of the dependent variable. I manually
> verified that this is in fact what is done by Excel's SLR.
>
> The alternative is to look at the NPP problem completely separately
> from SLR problem. This is simply estimating a population standard
> deviation from a sample. The sample consists for the residuals from
> the SLR problem, this fact is not used. The DF in such a process is
> N-1.
>
> For the estimation of standard deviation for the residuals (not just
> for the sample, but for the whole hypothetical population), which DF
> is theoretically correct, N-1 or N-2? As a disclaimer, I should say
> that using N-1 gives a greater discrepancy from the SLR than even the
> NPP yields. So it doesn't really help dispel the discrepancy. Be
> that as it may, however, I'm still interested in what is the
> theoretically correct choice for DF.
>
> P.S. I'm not interested in the Maximum Likelihood approach for the
> time being. Better to get a good understanding of the why's for one
> approach before broaching another approach.

Even if the true regression is linear and the error random
variables are iid normal, the sample residuals are not iid.
Their joint distribution is n-variate normal with zero means
and covariance matrix = (I - H)*sigma^2, where sigma^2 is the
variance of the error distribution and H is the "hat matrix"
associated with the predictors. The expected order statistics
of the residuals are not a simple linear function of the
expected order statistics of n iid normals.

Date Subject Author
1/9/13 Paul
1/9/13 Ray Koopman
1/9/13 David Jones
1/9/13 Paul
1/10/13 David Jones