The Math Forum

Ask Dr. Math - Questions and Answers from our Archives
Associated Topics || Dr. Math Home || Search Dr. Math

Probability of Duplicate Pairs

Date: 05/13/2003 at 07:23:54
From: Neil
Subject: Probability of generating duplicate addresses

I am trying to find the probability of getting at least 20 duplicate 
addresses when I draw a sample of 30,000 at random from UK households 
(estimate 21,000,000) where there is replacement every time a 
selection is made.

I have managed to work out the probablity of getting at least one 
duplicate by subtracting the probability of getting no duplicates from 
1. This is extremely high - coming in very close to 1. However, I 
cannot see how to calculate 2+, 3+ duplicates etc.

I have used the basic formula of 1-{[(N-n)/N]*[(N-n-1)/(N-1)]..etc} 
where N is the number in the universe, n is the number in the sample, 
and have repeated this n times to reflect the number of selections.

Date: 05/13/2003 at 11:07:21
From: Doctor Mitteldorf
Subject: Re: Probability of generating duplicate addresses

Neil -

A quick and practical answer to your question is this: There are
roughly (1/2) 30,000^2 pairs in your sample. If the pairs were
independent, the probability for each one to be a match would be
1/21,000,000. So the mean number of expected matches would be 

   (1/2) 30,000^2 / 21,000,000 = 21

The standard error in this number is roughly its square root, so you'd
expect 21 +/- 5 duplicates in your 30,000 draws. Because the numbers
are so high, this is quite an accurate estimate.

You might verify the estimate by a numerical experiment: write a
program to draw 30,000 random numbers between 1 and 21,000,000, then 
sort the numbers and count the duplicates. Repeat 100 times, and
you'll have a good idea whether the above estimate is valid.

Here are some thoughts on the formal (exact) solution to your problem, 
which I formulate thus: Suppose you draw n samples from a universe of 
N objects with replacement. What is the probability of exactly r 
duplicate pairs?

It is easier to approach this problem by counting permutations
separately, even though in the final analysis order doesn't matter. 
The number of possible samples of n is then just N^n. The subset in
which these are all different numbers N!/(N-n)!, so the probability
for r=0 duplicates is

   (N-n)! N^n

Suppose there are r duplicate pairs. Then the remaining (n-2r) samples 
can be selected (and ordered) from the remaining (N-r) universe 
objects in (N-r)!/((N-r)-(n-2r))! = (N-r)!/(N-n+r)! different ways.  

Where are the duplicate pairs located in the sample of n? This is the
number of ways of choosing (2r) from a list of n: C(n,2r).

There are C(N,r) ways of choosing which r objects will be the 

Finally, how many different ways may the r duplicates be arranged to 
make a list of length 2r? The answer is (2r)!/2^r. This is because 
there would be (2r)! permutations of the list if all the members were
distinct, but each swapping of a pair means we've overcounted by a
factor of 2.

Putting all this together, the probability of exactly n distinct pairs
(with no triples or quadruples) is 

      (N-r)! n! N! (2r)!                              
----------------------------------------   = 
(N-n+r)! (n-2r)! (2r)! (N-r)! r! 2^r N^n     

         N! n!
 (N-n+r)! (n-2r)! r! 2^r N^n

Practicing what I preach, I've tested this formula in 1,000,000
numerical trials for N=30 and n=15, r=0,1,2,3, and 4, and find that it
correctly predicts the number of pairs.

- Doctor Mitteldorf, The Math Forum 

Date: 05/16/2003 at 08:25:02
From: Neil
Subject: Thank you (Probability of generating duplicate addresses)

In the end I was able to work this out through trial and error - but 
it is lovely to have the formula to hand now and the explanation 
behind it. Given that I had found 20 duplications in the sample that I 
had drawn I wanted to check whether this was in line with what might 
be expected.  I now have a full answer. 

- Thank you again.  
Associated Topics:
College Probability
High School Permutations and Combinations
High School Probability

Search the Dr. Math Library:

Find items containing (put spaces between keywords):
Click only once for faster results:

[ Choose "whole words" when searching for a word like age.]

all keywords, in any order at least one, that exact phrase
parts of words whole words

Submit your own question to Dr. Math

[Privacy Policy] [Terms of Use]

Math Forum Home || Math Library || Quick Reference || Math Forum Search

Ask Dr. MathTM
© 1994- The Math Forum at NCTM. All rights reserved.