Probability of Duplicate PairsDate: 05/13/2003 at 07:23:54 From: Neil Subject: Probability of generating duplicate addresses I am trying to find the probability of getting at least 20 duplicate addresses when I draw a sample of 30,000 at random from UK households (estimate 21,000,000) where there is replacement every time a selection is made. I have managed to work out the probablity of getting at least one duplicate by subtracting the probability of getting no duplicates from 1. This is extremely high - coming in very close to 1. However, I cannot see how to calculate 2+, 3+ duplicates etc. I have used the basic formula of 1-{[(N-n)/N]*[(N-n-1)/(N-1)]..etc} where N is the number in the universe, n is the number in the sample, and have repeated this n times to reflect the number of selections. Date: 05/13/2003 at 11:07:21 From: Doctor Mitteldorf Subject: Re: Probability of generating duplicate addresses Neil - A quick and practical answer to your question is this: There are roughly (1/2) 30,000^2 pairs in your sample. If the pairs were independent, the probability for each one to be a match would be 1/21,000,000. So the mean number of expected matches would be (1/2) 30,000^2 / 21,000,000 = 21 The standard error in this number is roughly its square root, so you'd expect 21 +/- 5 duplicates in your 30,000 draws. Because the numbers are so high, this is quite an accurate estimate. You might verify the estimate by a numerical experiment: write a program to draw 30,000 random numbers between 1 and 21,000,000, then sort the numbers and count the duplicates. Repeat 100 times, and you'll have a good idea whether the above estimate is valid. Here are some thoughts on the formal (exact) solution to your problem, which I formulate thus: Suppose you draw n samples from a universe of N objects with replacement. What is the probability of exactly r duplicate pairs? It is easier to approach this problem by counting permutations separately, even though in the final analysis order doesn't matter. The number of possible samples of n is then just N^n. The subset in which these are all different numbers N!/(N-n)!, so the probability for r=0 duplicates is N! ---------- (N-n)! N^n Suppose there are r duplicate pairs. Then the remaining (n-2r) samples can be selected (and ordered) from the remaining (N-r) universe objects in (N-r)!/((N-r)-(n-2r))! = (N-r)!/(N-n+r)! different ways. Where are the duplicate pairs located in the sample of n? This is the number of ways of choosing (2r) from a list of n: C(n,2r). There are C(N,r) ways of choosing which r objects will be the duplicates. Finally, how many different ways may the r duplicates be arranged to make a list of length 2r? The answer is (2r)!/2^r. This is because there would be (2r)! permutations of the list if all the members were distinct, but each swapping of a pair means we've overcounted by a factor of 2. Putting all this together, the probability of exactly n distinct pairs (with no triples or quadruples) is (N-r)! n! N! (2r)! ---------------------------------------- = (N-n+r)! (n-2r)! (2r)! (N-r)! r! 2^r N^n N! n! ------------------------------ (N-n+r)! (n-2r)! r! 2^r N^n Practicing what I preach, I've tested this formula in 1,000,000 numerical trials for N=30 and n=15, r=0,1,2,3, and 4, and find that it correctly predicts the number of pairs. - Doctor Mitteldorf, The Math Forum http://mathforum.org/dr.math/ Date: 05/16/2003 at 08:25:02 From: Neil Subject: Thank you (Probability of generating duplicate addresses) In the end I was able to work this out through trial and error - but it is lovely to have the formula to hand now and the explanation behind it. Given that I had found 20 duplications in the sample that I had drawn I wanted to check whether this was in line with what might be expected. I now have a full answer. - Thank you again. Neil |
Search the Dr. Math Library: |
[Privacy Policy] [Terms of Use]
Ask Dr. Math^{TM}
© 1994-2015 The Math Forum
http://mathforum.org/dr.math/