On Mar 23, 9:16 pm, Steven D'Aprano <steve +comp.lang.pyt...@pearwood.info> wrote: > I'm trying to demonstrate numerically (rather than algebraically) that > the expectation of the sample variance is the population variance, but > it's not working for me. > > Some quick(?) background... please correct me if I'm wrong about anything. > > The variance of a population is: > > ?^2 = 1/n * ?(x-?)^2 over all x in the population > > where ^2 means superscript 2 (i.e. squared). In case you can't read the > symbols, here it is again in ASCII-only text: > > theta^2 = 1/n * SUM( (x-mu)^2 ) > > If you don't have the entire population as your data, you can estimate > the population variance by calculating a sample variance: > > s'^2 = 1/n * ?(x-?)^2 over all x in the sample > > where s' is being used instead of s subscript n. > > This is unbiased, provided you know the population mean mu ?. Normally > you don't though, and you're reduced to estimating it from your sample: > > s'^2 = 1/n * ?(x-m)^2 > > where m is being used as the symbol for sample mean x bar = ?x/n > > Unfortunately this sample variance is biased, so the "unbiased sample > variance" is used instead: > > s^2 = 1/(n-1) * ?(x-m)^2 > > What makes this unbiased is that the expected value of the sample > variances equals the true population variance. E.g. see > > http://en.wikipedia.org/wiki/Bessel's_correction > > The algebra convinces me -- I'm sure it's correct. But I'd like an easy > example I can show people, but it's not working for me! > > Let's start with a population of: [1, 2, 3, 4]. The true mean is 2.5 and > the true (population) variance is 1.25. > > All possible samples for each sample size > 1, and their exact sample > variances, are: > > n = 2 > 1,2 : 1/2 > 1,3 : 2 > 1,4 : 9/2 > 2,3 : 1/2 > 2,4 : 2 > 3,4 : 1/2 > Expectation for n=2: 5/3 > > n=3 > 1,2,3 : 1 > 1,3,4 : 7/3 > 2,3,4 : 1 > Expectation for n=3: 13/9 > > n=4 > 1,2,3,4 : 5/3 > Expectation for n=4: 5/3 > > As you can see, none of the expectations for a particular sample size are > equal to the population variance. If I instead add up all ten possible > sample variances, and divide by ten, I get 1.6 which is still not equal > to 1.25. > > What am I misunderstanding?
The formulae are correct only for a population with a Gaussian distribution. The distribution of your test population [1, 2, 3, 4] is not Gaussian, and its difference from normality is enough to give those differences in the sample variances. --