"Rich Ulrich" wrote in message news:email@example.com...
On Mon, 20 May 2013 21:43:18 -0700 (PDT), Fern <firstname.lastname@example.org> wrote:
>Hi, > >I have a question on trying to reverse engineer the probability density >function from which a set of numbers were generated. My setup is the >following: > >1) I have two probability density functions, both of whose domain is >bounded in [0,1]: >a) Beta (4,2) distribution >b) Uniform (0.358060,0.975273) distribution > >2) Note that the parameters of the Uniform distribution have been carefully >selected so that it has the same mean and variance as the Beta >distribution. > >3)From each distribution we generate 50 numbers > >4)We then sum these random numbers separately (for the beta and uniform) >and the value are placed as elements in two vectors (RandBeta and >RandUnif). > >5)We repeat steps 3-4 until the vectors RandBeta and RandUnif have 20,000 >elements each. > >In light of the Central Limit Theorem (which would hold for summing >variates drawn from the two distributions above) my question is whether it >is possible to examine the vectors RandBeta and RandUnif (without knowing >which is which) and determine which was generated from the Beta pdf and >which form the Uniform pdf? > >Thanks!
Selecting between two choices is not much reverse engineering.
The question is whether a sample of 20,000 is large enough to detect the detect the differences in distributions based on sampling the averages of 50 uniforms vs. 50 beta(4,2). Testing would depend higher-order moments than the first and second.
It is not clear that moments would be useful. In this context the ranges of possible values for the two averages are (0,1) and (0.358060,0.975273) ... so that as soon as a value of the average outside the range (0.358060,0.975273) occurs you know that the original distribution must have been the Beta (4,2) . Of course the probability of such an outcome from the Beta (4,2) average distribution might be too small for this to have much chance of happening within 20000 samples, but it perhaps indicates the way to go..... which to me seems to be to look at the tail behaviour
For the uniform case, there are certainly analytical expressions for the distribution function of the average. There may not be a corresponding analytical expression for the beta distribution, but there are possibilities of finding the distribution function numerically. If neither of these appeal, there are still possibilities of proceeding if the OP is prepared to generate samples known to be from one or other of the two sources. A suggestion would be to construct appropriate log-survivor plots for the two tails and to see how the sample version of these compares to either the known distributions (if possible), or to repeated samplings from the two sources. The repeated-sampling approach would at least give an idea of how much separation of the cases there can be in a sample 20000 values.