Date: Jan 9, 2013 2:56 AM Author: Paul Subject: Degrees of freedom (DF) in normality test I did a simple linear regression (SLR) on 2 equal-length vectors of

data, then subjected the residuals to a Normal Probability Plot (NPP,

http://en.wikipedia.org/wiki/Normal_probability_plot). The fit was

good, and there was no gross concavity, convexity, or "S" shape to

indicate skew or excess kurtosis.

From web browsing, I found that the mean and the standard deviation of

the normal distribution that is being tested for can be estimated by

the y-intercept and the slope of the NPP. In other words, a 2nd SLR

is performed on the residuals NPP scatter graph.

I am at a loss as to how to resolve a discrepancy. The estimate of

*standard deviation* of the residuals comes from slope of the NPP.

This should correspond to the standard error of the estimate from the

SLR on the 2 vectors of data. Shouldn't it?? It doesn't. There is a

notable error of about 5% (N=16 data points, yes I know it's a small

sample). I don't know which is the correct result. I am using

Excel's linear regression LINEST, but as will be clear, that's not all

that relevant to the problem since I can read their documentation to

ensure that they conform with textbook theory.

Part of this discrepancy can be explained by the fact that the formula

for the normal order statistic means is approximate (see the NPP

wikipedia page), but I suspect that it's not the main culprit because

the estimated *mean* of the residuals was highly accurate (in the

order of 1e-17, ideally zero).

I decided to manually calculate the estimate of the standard deviation

for residuals. This is simply the sum of the square (SS) of the

residuals (SSres), normalized by the DF, then square-rooted.

According to the SLR theory, the DF should be N-2 because one degree

is in getting the mean of the independent variable, and another is

lost in getting the mean of the dependent variable. I manually

verified that this is in fact what is done by Excel's SLR.

The alternative is to look at the NPP problem completely separately

from SLR problem. This is simply estimating a population standard

deviation from a sample. The sample consists for the residuals from

the SLR problem, this fact is not used. The DF in such a process is

N-1.

For the estimation of standard deviation for the residuals (not just

for the sample, but for the whole hypothetical population), which DF

is theoretically correct, N-1 or N-2? As a disclaimer, I should say

that using N-1 gives a greater discrepancy from the SLR than even the

NPP yields. So it doesn't really help dispel the discrepancy. Be

that as it may, however, I'm still interested in what is the

theoretically correct choice for DF.

P.S. I'm not interested in the Maximum Likelihood approach for the

time being. Better to get a good understanding of the why's for one

approach before broaching another approach.