Drexel dragonThe Math ForumDonate to the Math Forum

Ask Dr. Math - Questions and Answers from our Archives
_____________________________________________
Associated Topics || Dr. Math Home || Search Dr. Math
_____________________________________________

Testing a Set of Data for Normal Distribution

Date: 08/02/2008 at 18:27:49
From: Bugs
Subject: How to prove a set of data is under Gaussian Distribution?

I have a set of data which i have obtained through experiments.  I
need to prove that the data belongs to Gaussian distribution.  How do
I do that?  I am not that great in probability, so I am not sure how
its done.



Date: 08/04/2008 at 20:40:02
From: Doctor Achilles
Subject: Re: How to prove a set of data is under Gaussian Distribution?

Hi Bugs,

Thanks for writing to Dr. Math.

I had to test for normality once and it took me a long, long time to
figure out how.  I ended up finding out a lot of valuable information
in my statistics text: Biostatistical Analysis by J.H. Zar (4th
edition).  I will summarize my findings for you.  I should warn you,
none of the methods for calculating normality are easy.

Depending on why you need to do this test, some preliminary
information may be of value.

First, it is very hard to determine normality with small sample sizes.
Depending on how skewed, etc. your data are, it may just not be
possible to conclude either way.  In general, the assumption is that
data are normally distributed unless concluded otherwise, however for
the purposes of statistical tests performed on the data, that
assumption is not necessarily inviolate (see next paragraph).

Second, you may be trying to determine whether to perform a parametric
statistical test (such as a t-test or ANOVA) on your data or instead
perform a non-parametric test (such as a Wilcoxon test).  If that is
the case, you should know that parametric tests are more powerful than
non-parametric tests.  In other words, non-parametric tests might miss
a statistically significant difference that a parametric test would
find.  As a result of this fact, it is always okay to run a
non-parametric test (even on data that is normally distributed or on
data that might be normally distributed).

One common test for normality with which I am personally NOT familiar,
is the Kolmogorov-Smirnov test.  The math behind it is very involved,
and I would suggest you refer to other resources such as this page

  Wikipedia: Kolmogorov-Smirnov Test
    http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test 

if you want to learn more about this test.

There are 2 methods that I have some familiarity with for measuring
normality of a data set.


The first and easiest is the Chi-square test.  The advantage here is
the ease.  The disadvantage is that is is not very powerful.  In other
words, you may be unable to reject the hypothesis that your data is
normally distributed when another, more powerful test would detect a
deviation from normality.  It is also the only test that you can run
on small sample sizes.

Let's use this example data set:

  1.2, 1.4, 1.9, 3.1, 3.3, 3.6, 3.8, 4.2, 4.4, 6.1

To run this test for normality, first calculate the mean and standard
deviation for your data set.

  Mean = 3.3
  StDev = 1.5

Then, put your data into a histogram.

  Bin  |  Observed
 ------------------
  0-1  |    0
  1-2  |    3
  2-3  |    0
  3-4  |    4
  4-5  |    2
  5-6  |    0
  6-7  |    1
  7-8  |    0

Next, make an "ideal" histogram based only on the mean and standard
deviation.  In other words, for a perfectly normally distributed data
set with a mean of 3.3 and a standard deviation of 1.5, what part of
the data would we expect to fall into each of the bins?

The function for this is the Gaussian Distribution, which is defined as:

  f(x) = a*e^(-(x-m)^2/(2s^2))

Where "e" is the base of natural logarithms

  e = 2.71828...
    http://mathforum.org/dr.math/faq/faq.e.html 

"x" is a given value we might observe, "m" is the mean of our
distribution, "s" is the standard deviation, and "a" is a scaling
factor which should be equal to 0.266 times the size of our original
data set.

Our original data set had 10 items in it, so a = 0.266*10 = 2.66, the
mean of our original data set was 3.3, so m = 3.3, and the StDev of
our original data set was 1.5, so s = 1.5.

So our function becomes:

  f(x) = 2.66e^(-(x-3.3)^2/(2*1.5^2))

or:

  f(x) = 2.66e^(-(x-3.3)^2/4.5)

Now we use this to generate a new set of values.  To do this, we take
the integral of the distribution over each range.  So, the integral of
the function from x=0 to x=1 is 0.49.  That means that if we took 10
samples from a normal distribution, we would expect 0.49 occurrences
of a value between 0 and 1.

The integral from 1 to 2 is 1.30.  So we would expect 1.30 occurrences
of a value between 1 and 2 if we took 10 samples.

We can generate a table of the expected number of occurrences of each
bin from our histogram:

   Bin  |  Expected
  ------------------
   0-1  |   0.49
   1-2  |   1.30
   2-3  |   2.28
   3-4  |   2.59
   4-5  |   1.92
   5-6  |   0.93
   6-7  |   0.29
   7-8  |   0.06

Now, we run the Chi-square test.  For more information on how this
test works, check out:

  Chi-Square Test
    http://mathforum.org/library/drmath/view/60432.html 

Essentially, what we do is set up a table of expected measurements and
actual measurements for each bin:

   Bin  |  Expected  |  Observed
  -------------------------------
   0-1  |   0.49     |     0
   1-2  |   1.30     |     3
   2-3  |   2.28     |     0
   3-4  |   2.59     |     4
   4-5  |   1.92     |     2
   5-6  |   0.93     |     0
   6-7  |   0.29     |     1
   7-8  |   0.06     |     0

Then we take (expected - observed)^2 for each row.  This is the
chi-square value:

   Bin  |  Expected  |  Observed  |  Chi-square
  ----------------------------------------------
   0-1  |   0.49     |     0      |    0.2401
   1-2  |   1.30     |     3      |    2.8900
   2-3  |   2.28     |     0      |    5.1984
   3-4  |   2.59     |     4      |    1.9881
   4-5  |   1.92     |     2      |    0.0064
   5-6  |   0.93     |     0      |    0.8649
   6-7  |   0.29     |     1      |    0.5041
   7-8  |   0.06     |     0      |    0.0036

We add those all up and that gives us our chi-square statistic.  The
sum is 11.7956.

With 10 samples we have 9 degrees of freedom, this gives us a 
probability of between 0.25 and 0.1 that the data are normally
distributed.  In other words, it is unlikely (less than 25% chance)
that the data are normally distributed.

Traditionally, in statistics, you need a p-value of less than 0.05 to
reject the null hypothesis.  In this case, the null hypothesis was
normality.  Because our p value is greater than 0.05 (actually, it's
greater than 0.10), we cannot reject the null hypothesis.  Therefore,
we have not proven that this data set is different from normality.

Phew!  Ok, that was the first way to test normality.

You may have noticed in doing this that the size we chose for our
bins was somewhat arbitrary.  What would have happened if I chose bins
of twice that size?  Or of half?


The other test of normality is the most powerful but also the most
math intensive.  It uses two different parameters: skew and kurtosis.
The math requires n>20, and really you need n>50 or so to have any
power, so this doesn't work with small sample sizes.

A normal distribution is symmetric about the mean.  Skew is a measure
of how much the bell-curve for your data set is heavy on one side.

A normal distribution also has a specific width for a given height. 
If you double the height, the width scales proportionally.  However,
you could imagine stretching a bell curve out in weird ways without
changing its symmetry.  You could have a sharp, pointy distribution,
or a fat, boxy one.  The pointy ones have positive "kurtosis" and the
boxy ones have negative "kurtosis".  A good statistics program should
be able to calculate kurtosis for you.

If your data set is larger than 20, you can try testing for normality
using the D'Agostino-Pearson test.  The basic idea is to normalize the
measure of the kurtosis and the skewness to a common value (based on
the sample size) and then add those normalized values together.  This
can then be tested for significant deviations from normality.  

You can read more about the D'Agostino-Pearson test and get a table
that can be used in Excel here:

  Wikipedia: Normality Test
    http://en.wikipedia.org/wiki/User:Xargque#Normality_Test 


Finally, another test that is related to the D'Agostino-Pearson test
but is a little simpler is the Jarque-Bera Test.  It seems a little
more common and straight-forward.  Details can be found here:

  Wikipedia: Jarque-Bera Test
    http://en.wikipedia.org/wiki/Jarque-Bera_test 

One item of note: depending on how your stats program calculates
kurtosis, you may or may not need to subtract 3 from kurtosis.  See:

  Wikipedia Talk: Jarque-Bera Test
    http://en.wikipedia.org/wiki/Talk:Jarque-Bera_test 

The D'Agostino-Pearson test assumed that kurtosis of a Normal
Distribution was 0, but some stats programs (for reasons that mystify
me) have kurtosis of a normal distribution set to 3.  You should
figure out which way your stats program calculates kurtosis.


I hope this has been helpful.  If you want to talk about this some
more or if you still are having trouble figuring out if your data set
is normally distributed, let me know.

- Doctor Achilles, The Math Forum
  http://mathforum.org/dr.math/ 




Date: 08/06/2008 at 10:17:58
From: Bugs
Subject: Thank you (How to prove a set of data is under Gaussian
Distribution?)

Dear Doctor Achilles,

You are just amazing.  Your explanations are so clear.  I am so
thankful to you.  I actually have a large data set around 4000 samples
for each different case.  Some cases the bell curve is skewed for
some its not.  Overall I need to prove that the distribution
is Gaussian.  I am planning to use D'Agostino-Pearson test after
reading your mail. I will also try other tests you mentioned.

Thank you so much for all the trouble.

Cheers,

Bugs
Associated Topics:
College Statistics

Search the Dr. Math Library:


Find items containing (put spaces between keywords):
 
Click only once for faster results:

[ Choose "whole words" when searching for a word like age.]

all keywords, in any order at least one, that exact phrase
parts of words whole words

Submit your own question to Dr. Math

[Privacy Policy] [Terms of Use]

_____________________________________
Math Forum Home || Math Library || Quick Reference || Math Forum Search
_____________________________________

Ask Dr. MathTM
© 1994-2013 The Math Forum
http://mathforum.org/dr.math/