Date: Aug 31, 2011 2:02 PM
Author: Steven D'Aprano
Subject: Re: Which sample variance should I choose?
Paige Miller wrote:

> On Aug 31, 1:06 am, Steven D'Aprano <steve

> +comp.lang.pyt...@pearwood.info> wrote:

[...]

>> Under what circumstances should I prefer each of these four estimators of

>> ?^2 and what are the pros and cons of each?

>

> There is no answer until you tell us what you are planning to use the

> variance for.

That's exactly what I'm trying to find out. Under which circumstances should

I prefer one method over the others?

I don't actually have a *specific* usage in mind, other than answering the

question "what's the sample variance of this data?" But I would like to

understand why somebody might choose one version or another.

E.g.

the unbiased sample variance (divide by n-1) has the advantage that, on

average, it will equal the population variance (provided certain

assumptions hold, such as sampling with replacement);

but the unbiased sample variance also has a larger spread, so although it is

the most accurate on average, there's a chance that it will be much further

off. The biased sample variance (divide by n) is less accurate but more

precise (the results are clustered more closely together, so the chances of

getting a result that is *way* off is much reduced);

etc. Or at least, this is what I *think* is the case.

I'm not even sure that it is mathematically valid to substitute µ into the

sample variance formulae instead of the sample mean. I can't see why it

wouldn't be, but I'm not sure.

For reference, here's the suggested sample variance formulae again:

s^2 = ?(x - m)^2 / n (Eq. 1) Biased, using sample mean

s^2 = ?(x - m)^2 / (n-1) (Eq. 2) Unbiased, using sample mean

s^2 = ?(x - µ)^2 / n (Eq. 3) Biased, using population mean

s^2 = ?(x - µ)^2 / (n-1) (Eq. 4) Unbiased, using population mean

where the sums are over each x in the sample.

--

Steven