> On Aug 31, 1:06 am, Steven D'Aprano <steve > +comp.lang.pyt...@pearwood.info> wrote: [...] >> Under what circumstances should I prefer each of these four estimators of >> ?^2 and what are the pros and cons of each? > > There is no answer until you tell us what you are planning to use the > variance for.
That's exactly what I'm trying to find out. Under which circumstances should I prefer one method over the others?
I don't actually have a *specific* usage in mind, other than answering the question "what's the sample variance of this data?" But I would like to understand why somebody might choose one version or another.
the unbiased sample variance (divide by n-1) has the advantage that, on average, it will equal the population variance (provided certain assumptions hold, such as sampling with replacement);
but the unbiased sample variance also has a larger spread, so although it is the most accurate on average, there's a chance that it will be much further off. The biased sample variance (divide by n) is less accurate but more precise (the results are clustered more closely together, so the chances of getting a result that is *way* off is much reduced);
etc. Or at least, this is what I *think* is the case.
I'm not even sure that it is mathematically valid to substitute µ into the sample variance formulae instead of the sample mean. I can't see why it wouldn't be, but I'm not sure.
For reference, here's the suggested sample variance formulae again:
s^2 = ?(x - m)^2 / n (Eq. 1) Biased, using sample mean s^2 = ?(x - m)^2 / (n-1) (Eq. 2) Unbiased, using sample mean s^2 = ?(x - µ)^2 / n (Eq. 3) Biased, using population mean s^2 = ?(x - µ)^2 / (n-1) (Eq. 4) Unbiased, using population mean