Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: Kolmogorov–Smirnov / Lilliefors test, small sample
s

Replies: 5   Last Post: Aug 11, 2013 5:23 PM

 Messages: [ Previous | Next ]
 David Jones Posts: 80 Registered: 2/9/12
Re: Kolmogorov–Smirnov / Lillief
ors test, small samples

Posted: Jul 24, 2013 8:46 AM

-----Original Message-----
From: andymhancock@gmail.com
Sent: Wednesday, July 24, 2013 3:31 AM Newsgroups: sci.stat.math Subject:
Kolmogorov?Smirnov / Lilliefors test, small samples

I've been reading up on Kolmogorov-Smirnov (KS) and Lilliefors (LF) tests.
I realize there are other tests, but I'm just trying to understand a sublety
of the KS/LF test from an academic perspective. The test statistic is the
maximum difference in the CDFs, and in a typical usage scenario, one of the
two CDFs being compared is a reference distribution, often a theoretical
and/or hypothesized distribution, while the other CDF is an empirical CDF
from a sample (EDF). For small samples, the EDF is staircase shaped, with
the left end of each stop being closed end of an interval and the right end
being the open end. The thresholds for rejection are tabulated for various
signifcance levels and sample sizes. The LF thresholds are generated from
Monte Carlo simulation, and they take into account the fact that the test
statistic is smaller when the parameters of the reference distribution are
estimated from the data sample.

Whew. OK, that's all I know.

Now for the question. Let's call F0(x) the reference CDF and F1(x) the EDF
to be tested against F0(x). Let the difference by deltaCDF(x). Then the
test statistic is max of deltaCDF(x) over x. For small sample sizes, F1(x)
has distinct steps. Many tests and visualizations evaluate a metric only at
the point of data sample. If that is done for the KS/LF tests, then
deltaCDF(x) is only evaluated only at x-values where the sample contains
data. That would correspond the closed end (left end) of each staircase
step. However, it is possible for deltaCDF(x) to increase toward the right
end of each staircase step. So it is possible for the test staircase
max[deltaCDF(x)] to exceed a selected threshold without the analyst knowing

Is this actually a problem? I mean, theoretically it seems to be. However,
if each tabulated threshold is arrived at by compiling countless cases in
which max[deltaCDF(x)] is determined only at x-values in the data sample,
then the theory becomes irrelevant.

========================================================================

Well, it is and it isn't a problem. All relevant theoretical and practical
works take care of the problem by carefully defining the test statistic
being used so that the problem does not arise. Notionally this just involves
assessing both F1(x) and F1(x-) at each observed data point in comparison to
F0(x), but it is often expressed in more computationally relevant terms.
See, for example:

Biometrika Tables for Statisticians, Volume 2, p118

Empirical Processes with Applications to Statistics, by Shorak & Wellner
(Wiley).

The point is an important one since, unless dealt with properly, one could
end up with different results according to whether or not one chooses to
multiply all data (and change the modelled distribution accordingly) by -1.
If you look further, you will see that related problems of definition of
test statistics for multivariate distributions arise respect to ordering and
orientation of data directions of the various axes.

David Jones

Date Subject Author
7/23/13 AndyHancock
7/24/13 David Jones
7/24/13 AndyHancock
7/24/13 David Jones
7/24/13 AndyHancock
8/11/13 Luis A. Afonso