Testing a Set of Data for Normal Distribution
Date: 08/02/2008 at 18:27:49 From: Bugs Subject: How to prove a set of data is under Gaussian Distribution? I have a set of data which i have obtained through experiments. I need to prove that the data belongs to Gaussian distribution. How do I do that? I am not that great in probability, so I am not sure how its done.
Date: 08/04/2008 at 20:40:02 From: Doctor Achilles Subject: Re: How to prove a set of data is under Gaussian Distribution? Hi Bugs, Thanks for writing to Dr. Math. I had to test for normality once and it took me a long, long time to figure out how. I ended up finding out a lot of valuable information in my statistics text: Biostatistical Analysis by J.H. Zar (4th edition). I will summarize my findings for you. I should warn you, none of the methods for calculating normality are easy. Depending on why you need to do this test, some preliminary information may be of value. First, it is very hard to determine normality with small sample sizes. Depending on how skewed, etc. your data are, it may just not be possible to conclude either way. In general, the assumption is that data are normally distributed unless concluded otherwise, however for the purposes of statistical tests performed on the data, that assumption is not necessarily inviolate (see next paragraph). Second, you may be trying to determine whether to perform a parametric statistical test (such as a t-test or ANOVA) on your data or instead perform a non-parametric test (such as a Wilcoxon test). If that is the case, you should know that parametric tests are more powerful than non-parametric tests. In other words, non-parametric tests might miss a statistically significant difference that a parametric test would find. As a result of this fact, it is always okay to run a non-parametric test (even on data that is normally distributed or on data that might be normally distributed). One common test for normality with which I am personally NOT familiar, is the Kolmogorov-Smirnov test. The math behind it is very involved, and I would suggest you refer to other resources such as this page Wikipedia: Kolmogorov-Smirnov Test http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test if you want to learn more about this test. There are 2 methods that I have some familiarity with for measuring normality of a data set. The first and easiest is the Chi-square test. The advantage here is the ease. The disadvantage is that is is not very powerful. In other words, you may be unable to reject the hypothesis that your data is normally distributed when another, more powerful test would detect a deviation from normality. It is also the only test that you can run on small sample sizes. Let's use this example data set: 1.2, 1.4, 1.9, 3.1, 3.3, 3.6, 3.8, 4.2, 4.4, 6.1 To run this test for normality, first calculate the mean and standard deviation for your data set. Mean = 3.3 StDev = 1.5 Then, put your data into a histogram. Bin | Observed ------------------ 0-1 | 0 1-2 | 3 2-3 | 0 3-4 | 4 4-5 | 2 5-6 | 0 6-7 | 1 7-8 | 0 Next, make an "ideal" histogram based only on the mean and standard deviation. In other words, for a perfectly normally distributed data set with a mean of 3.3 and a standard deviation of 1.5, what part of the data would we expect to fall into each of the bins? The function for this is the Gaussian Distribution, which is defined as: f(x) = a*e^(-(x-m)^2/(2s^2)) Where "e" is the base of natural logarithms e = 2.71828... http://mathforum.org/dr.math/faq/faq.e.html "x" is a given value we might observe, "m" is the mean of our distribution, "s" is the standard deviation, and "a" is a scaling factor which should be equal to 0.266 times the size of our original data set. Our original data set had 10 items in it, so a = 0.266*10 = 2.66, the mean of our original data set was 3.3, so m = 3.3, and the StDev of our original data set was 1.5, so s = 1.5. So our function becomes: f(x) = 2.66e^(-(x-3.3)^2/(2*1.5^2)) or: f(x) = 2.66e^(-(x-3.3)^2/4.5) Now we use this to generate a new set of values. To do this, we take the integral of the distribution over each range. So, the integral of the function from x=0 to x=1 is 0.49. That means that if we took 10 samples from a normal distribution, we would expect 0.49 occurrences of a value between 0 and 1. The integral from 1 to 2 is 1.30. So we would expect 1.30 occurrences of a value between 1 and 2 if we took 10 samples. We can generate a table of the expected number of occurrences of each bin from our histogram: Bin | Expected ------------------ 0-1 | 0.49 1-2 | 1.30 2-3 | 2.28 3-4 | 2.59 4-5 | 1.92 5-6 | 0.93 6-7 | 0.29 7-8 | 0.06 Now, we run the Chi-square test. For more information on how this test works, check out: Chi-Square Test http://mathforum.org/library/drmath/view/60432.html Essentially, what we do is set up a table of expected measurements and actual measurements for each bin: Bin | Expected | Observed ------------------------------- 0-1 | 0.49 | 0 1-2 | 1.30 | 3 2-3 | 2.28 | 0 3-4 | 2.59 | 4 4-5 | 1.92 | 2 5-6 | 0.93 | 0 6-7 | 0.29 | 1 7-8 | 0.06 | 0 Then we take (expected - observed)^2 for each row. This is the chi-square value: Bin | Expected | Observed | Chi-square ---------------------------------------------- 0-1 | 0.49 | 0 | 0.2401 1-2 | 1.30 | 3 | 2.8900 2-3 | 2.28 | 0 | 5.1984 3-4 | 2.59 | 4 | 1.9881 4-5 | 1.92 | 2 | 0.0064 5-6 | 0.93 | 0 | 0.8649 6-7 | 0.29 | 1 | 0.5041 7-8 | 0.06 | 0 | 0.0036 We add those all up and that gives us our chi-square statistic. The sum is 11.7956. With 10 samples we have 9 degrees of freedom, this gives us a probability of between 0.25 and 0.1 that the data are normally distributed. In other words, it is unlikely (less than 25% chance) that the data are normally distributed. Traditionally, in statistics, you need a p-value of less than 0.05 to reject the null hypothesis. In this case, the null hypothesis was normality. Because our p value is greater than 0.05 (actually, it's greater than 0.10), we cannot reject the null hypothesis. Therefore, we have not proven that this data set is different from normality. Phew! Ok, that was the first way to test normality. You may have noticed in doing this that the size we chose for our bins was somewhat arbitrary. What would have happened if I chose bins of twice that size? Or of half? The other test of normality is the most powerful but also the most math intensive. It uses two different parameters: skew and kurtosis. The math requires n>20, and really you need n>50 or so to have any power, so this doesn't work with small sample sizes. A normal distribution is symmetric about the mean. Skew is a measure of how much the bell-curve for your data set is heavy on one side. A normal distribution also has a specific width for a given height. If you double the height, the width scales proportionally. However, you could imagine stretching a bell curve out in weird ways without changing its symmetry. You could have a sharp, pointy distribution, or a fat, boxy one. The pointy ones have positive "kurtosis" and the boxy ones have negative "kurtosis". A good statistics program should be able to calculate kurtosis for you. If your data set is larger than 20, you can try testing for normality using the D'Agostino-Pearson test. The basic idea is to normalize the measure of the kurtosis and the skewness to a common value (based on the sample size) and then add those normalized values together. This can then be tested for significant deviations from normality. You can read more about the D'Agostino-Pearson test and get a table that can be used in Excel here: Wikipedia: Normality Test http://en.wikipedia.org/wiki/User:Xargque#Normality_Test Finally, another test that is related to the D'Agostino-Pearson test but is a little simpler is the Jarque-Bera Test. It seems a little more common and straight-forward. Details can be found here: Wikipedia: Jarque-Bera Test http://en.wikipedia.org/wiki/Jarque-Bera_test One item of note: depending on how your stats program calculates kurtosis, you may or may not need to subtract 3 from kurtosis. See: Wikipedia Talk: Jarque-Bera Test http://en.wikipedia.org/wiki/Talk:Jarque-Bera_test The D'Agostino-Pearson test assumed that kurtosis of a Normal Distribution was 0, but some stats programs (for reasons that mystify me) have kurtosis of a normal distribution set to 3. You should figure out which way your stats program calculates kurtosis. I hope this has been helpful. If you want to talk about this some more or if you still are having trouble figuring out if your data set is normally distributed, let me know. - Doctor Achilles, The Math Forum http://mathforum.org/dr.math/
Date: 08/06/2008 at 10:17:58 From: Bugs Subject: Thank you (How to prove a set of data is under Gaussian Distribution?) Dear Doctor Achilles, You are just amazing. Your explanations are so clear. I am so thankful to you. I actually have a large data set around 4000 samples for each different case. Some cases the bell curve is skewed for some its not. Overall I need to prove that the distribution is Gaussian. I am planning to use D'Agostino-Pearson test after reading your mail. I will also try other tests you mentioned. Thank you so much for all the trouble. Cheers, Bugs
Search the Dr. Math Library:
Ask Dr. MathTM
© 1994- The Math Forum at NCTM. All rights reserved.