Date: Jan 9, 2013 2:56 AM
Subject: Degrees of freedom (DF) in normality test
I did a simple linear regression (SLR) on 2 equal-length vectors of
data, then subjected the residuals to a Normal Probability Plot (NPP,
http://en.wikipedia.org/wiki/Normal_probability_plot). The fit was
good, and there was no gross concavity, convexity, or "S" shape to
indicate skew or excess kurtosis.
From web browsing, I found that the mean and the standard deviation of
the normal distribution that is being tested for can be estimated by
the y-intercept and the slope of the NPP. In other words, a 2nd SLR
is performed on the residuals NPP scatter graph.
I am at a loss as to how to resolve a discrepancy. The estimate of
*standard deviation* of the residuals comes from slope of the NPP.
This should correspond to the standard error of the estimate from the
SLR on the 2 vectors of data. Shouldn't it?? It doesn't. There is a
notable error of about 5% (N=16 data points, yes I know it's a small
sample). I don't know which is the correct result. I am using
Excel's linear regression LINEST, but as will be clear, that's not all
that relevant to the problem since I can read their documentation to
ensure that they conform with textbook theory.
Part of this discrepancy can be explained by the fact that the formula
for the normal order statistic means is approximate (see the NPP
wikipedia page), but I suspect that it's not the main culprit because
the estimated *mean* of the residuals was highly accurate (in the
order of 1e-17, ideally zero).
I decided to manually calculate the estimate of the standard deviation
for residuals. This is simply the sum of the square (SS) of the
residuals (SSres), normalized by the DF, then square-rooted.
According to the SLR theory, the DF should be N-2 because one degree
is in getting the mean of the independent variable, and another is
lost in getting the mean of the dependent variable. I manually
verified that this is in fact what is done by Excel's SLR.
The alternative is to look at the NPP problem completely separately
from SLR problem. This is simply estimating a population standard
deviation from a sample. The sample consists for the residuals from
the SLR problem, this fact is not used. The DF in such a process is
For the estimation of standard deviation for the residuals (not just
for the sample, but for the whole hypothetical population), which DF
is theoretically correct, N-1 or N-2? As a disclaimer, I should say
that using N-1 gives a greater discrepancy from the SLR than even the
NPP yields. So it doesn't really help dispel the discrepancy. Be
that as it may, however, I'm still interested in what is the
theoretically correct choice for DF.
P.S. I'm not interested in the Maximum Likelihood approach for the
time being. Better to get a good understanding of the why's for one
approach before broaching another approach.