Search All of the Math Forum:
Views expressed in these public forums are not endorsed by
NCTM or The Math Forum.



Kolmogorov–Smirnov / Lilliefors test, small sample s
Posted:
Jul 23, 2013 10:31 PM


I've been reading up on KolmogorovSmirnov (KS) and Lilliefors (LF) tests. I realize there are other tests, but I'm just trying to understand a sublety of the KS/LF test from an academic perspective. The test statistic is the maximum difference in the CDFs, and in a typical usage scenario, one of the two CDFs being compared is a reference distribution, often a theoretical and/or hypothesized distribution, while the other CDF is an empirical CDF from a sample (EDF). For small samples, the EDF is staircase shaped, with the left end of each stop being closed end of an interval and the right end being the open end. The thresholds for rejection are tabulated for various signifcance levels and sample sizes. The LF thresholds are generated from Monte Carlo simulation, and they take into account the fact that the test statistic is smaller when the parameters of the reference distribution are estimated from the data sample.
Whew. OK, that's all I know.
Now for the question. Let's call F0(x) the reference CDF and F1(x) the EDF to be tested against F0(x). Let the difference by deltaCDF(x). Then the test statistic is max of deltaCDF(x) over x. For small sample sizes, F1(x) has distinct steps. Many tests and visualizations evaluate a metric only at the point of data sample. If that is done for the KS/LF tests, then deltaCDF(x) is only evaluated only at xvalues where the sample contains data. That would correspond the closed end (left end) of each staircase step. However, it is possible for deltaCDF(x) to increase toward the right end of each staircase step. So it is possible for the test staircase max[deltaCDF(x)] to exceed a selected threshold without the analyst knowing about it.
Is this actually a problem? I mean, theoretically it seems to be. However, if each tabulated threshold is arrived at by compiling countless cases in which max[deltaCDF(x)] is determined only at xvalues in the data sample, then the theory becomes irrelevant.



