On Apr 30, 9:37 pm, Rich Ulrich <rich.ulr...@comcast.net> wrote: > On Sat, 28 Apr 2012 08:18:48 -0700 (PDT), "analys...@hotmail.com" > > > > > > <analys...@hotmail.com> wrote: > >This problem occurs a lot in real life.. > > >You sample n people, and a proportion p of them are found to be > >carrying a red flag (like some political party, prefer a brand of soap > >etc.). Textbooks say that the estimate of the proportion carrying the > >red flag in the total population is p with a variance of n.p.(1-p). > >This would indicate that p close to 0 or 100 pct can be estimated with > >smaller samples than p around 50 pct with the same confidence. > > >But suppose we have carried out these samplings repeatedly and past > >results show that the proportion carrying the red flag always comes in > >between 0 and say 15 pct. We can even estimate a histogram > >distribution of p from past samples. If we now make a new sampling of > >n items - and we wish to rely on the past sampling results, how would > >the mean and variance estimates change? > > >Thanks for any replies. > > I thought you would elicit some sort of Bayesian answer, > but that hasn't happened. > > Bayesian computation uses a "prior distribution" and > comes up with a combined, Bayesian estimate -- but that > is not the same, exactly, as reporting Mean and SD. > And I'm not a bayesian advocate, nor am I up-to-speed > on what they are doing, but my impression is that the > results, in terms of narrowing or modifying the estimators > is ordinarily of the magnitude that you get by adding a > total of 1 case, or very few cases, to the observed sample > size. > > If you want to make a statement based on a long time-series > of observations, there are classical techniques that *might* > be applicable -- What is appropriate would depend on > whether you are tapping some dimension that you is > constant ("is thought to be constant") or that might > have a slow change, relative to the number of census points. > > For the simplest instance -- If there is no change expected > or suggested by the data, you might decide to pool all the > avialable data, and present the overall mean and SD, based > on the total N. -- If that comes to a really large N, it will > produce a SD that is too small, because it will not take into > account the standard error of the bias of the estiumations. > > If there is slow change, you might argue for a time-series > projection. That would mainly use the most recent points, > but it might afford a more precise estimate of the present > mean than you get by using the latest data alone. > > -- > Rich Ulrich- Hide quoted text - > > - Show quoted text -
Thanks, Rich and David. I thought some more about the problem and it seems to me that we have to specify what is being measured.
(1) The classical problem can be stated in terms of an urn with a fixed number of black and white balls. You sample n balls (the number of balls in the urn is >> n so that replacement or non-replacement doesn't matter) and m of them turn out to be black. m/n is the best estimate of the proportion of black balls in the urn.
(2) In the "Bayesian" version there are k urns with the proportion of black balls in urn j being p(j), which are all known. You control n the number of balls sampled, but they all come from a single urn whose ientity is not known to you. The problem here is that if m of them turned out to be black what is the probablity that they all came from urn j.
In real life - case (1) applies when you are emasuring an objective reality outside your sampling - such as the proportion of women or left-handed people in a population. In thsi case the observed variation arises purely from the finiteness of the sampling. Successive samples should simply be cumulated to get the best estimate of the population proportion.
Case (2) applies when each sampling is actually a "campaign" of sorts - you send out n mailings that solicit some action and the response rate is not something thats objectively out there independent of your measurement. But if all "campaigns" are not too dissimilar from each other, then past response rates can be used as a guide as to what to expect. In this case there are two sources of variation - which past campaign your current campaign is most similar to and secondarily, the normal sampling variation from finite sampling.