Date: Aug 31, 2011 2:02 PM
Author: Steven D'Aprano
Subject: Re: Which sample variance should I choose?

Paige Miller wrote:

> On Aug 31, 1:06 am, Steven D'Aprano <steve
> +comp.lang.pyt...@pearwood.info> wrote:

[...]
>> Under what circumstances should I prefer each of these four estimators of
>> ?^2 and what are the pros and cons of each?

>
> There is no answer until you tell us what you are planning to use the
> variance for.


That's exactly what I'm trying to find out. Under which circumstances should
I prefer one method over the others?

I don't actually have a *specific* usage in mind, other than answering the
question "what's the sample variance of this data?" But I would like to
understand why somebody might choose one version or another.

E.g.

the unbiased sample variance (divide by n-1) has the advantage that, on
average, it will equal the population variance (provided certain
assumptions hold, such as sampling with replacement);

but the unbiased sample variance also has a larger spread, so although it is
the most accurate on average, there's a chance that it will be much further
off. The biased sample variance (divide by n) is less accurate but more
precise (the results are clustered more closely together, so the chances of
getting a result that is *way* off is much reduced);

etc. Or at least, this is what I *think* is the case.

I'm not even sure that it is mathematically valid to substitute µ into the
sample variance formulae instead of the sample mean. I can't see why it
wouldn't be, but I'm not sure.

For reference, here's the suggested sample variance formulae again:

s^2 = ?(x - m)^2 / n   (Eq. 1) Biased, using sample mean
s^2 = ?(x - m)^2 / (n-1) (Eq. 2) Unbiased, using sample mean
s^2 = ?(x - µ)^2 / n (Eq. 3) Biased, using population mean
s^2 = ?(x - µ)^2 / (n-1) (Eq. 4) Unbiased, using population mean

where the sums are over each x in the sample.


--
Steven