Date: Sep 1, 2011 4:30 AM Author: Steven D'Aprano Subject: Re: Which sample variance should I choose? On Thu, 1 Sep 2011 04:34 am Paul wrote:

> If you know the population mean, then

> s2_1 = ?(x - µ)^2 / n

> is unbiased.

>

> If you don't know the population mean, then

> s2_2 = ?(x - m)^2 / (n-1)

> is unbiased, while

> s2_3 = ?(x - m)^2 / n

> is biased but nevertheless more accurate than s2_2.

Thanks Paul, that's exactly the sort of thing I'm looking for.

Not that I don't believe you :) but if you also have a reference (especially

one that's online) that would be really helpful.

Thanks to everyone who answered.

Steven.

> None of these distinctions matters if n is reasonably large

>

> On Aug 31, 1:02 pm, Steven D'Aprano <steve

> +comp.lang.pyt...@pearwood.info> wrote:

>> Paige Miller wrote:

>> > On Aug 31, 1:06 am, Steven D'Aprano <steve

>> > +comp.lang.pyt...@pearwood.info> wrote:

>> [...]

>> >> Under what circumstances should I prefer each of these four estimators

>> >> of ?^2 and what are the pros and cons of each?

>>

>> > There is no answer until you tell us what you are planning to use the

>> > variance for.

>>

>> That's exactly what I'm trying to find out. Under which circumstances

>> should I prefer one method over the others?

>>

>> I don't actually have a *specific* usage in mind, other than answering

>> the question "what's the sample variance of this data?" But I would like

>> to understand why somebody might choose one version or another.

>>

>> E.g.

>>

>> the unbiased sample variance (divide by n-1) has the advantage that, on

>> average, it will equal the population variance (provided certain

>> assumptions hold, such as sampling with replacement);

>>

>> but the unbiased sample variance also has a larger spread, so although it

>> is the most accurate on average, there's a chance that it will be much

>> further off. The biased sample variance (divide by n) is less accurate

>> but more precise (the results are clustered more closely together, so the

>> chances of getting a result that is *way* off is much reduced);

>>

>> etc. Or at least, this is what I *think* is the case.

>>

>> I'm not even sure that it is mathematically valid to substitute µ into

>> the sample variance formulae instead of the sample mean. I can't see why

>> it wouldn't be, but I'm not sure.

>>

>> For reference, here's the suggested sample variance formulae again:

>>

>> s^2 = ?(x - m)^2 / n (Eq. 1) Biased, using sample mean

>> s^2 = ?(x - m)^2 / (n-1) (Eq. 2) Unbiased, using sample mean

>> s^2 = ?(x - µ)^2 / n (Eq. 3) Biased, using population mean

>> s^2 = ?(x - µ)^2 / (n-1) (Eq. 4) Unbiased, using population mean

>>

>> where the sums are over each x in the sample.

>>

>> --

>> Steven

--

Steven