Associated Topics || Dr. Math Home || Search Dr. Math

### Probability of Duplicate Pairs

```Date: 05/13/2003 at 07:23:54
From: Neil
Subject: Probability of generating duplicate addresses

I am trying to find the probability of getting at least 20 duplicate
addresses when I draw a sample of 30,000 at random from UK households
(estimate 21,000,000) where there is replacement every time a

I have managed to work out the probablity of getting at least one
duplicate by subtracting the probability of getting no duplicates from
1. This is extremely high - coming in very close to 1. However, I
cannot see how to calculate 2+, 3+ duplicates etc.

I have used the basic formula of 1-{[(N-n)/N]*[(N-n-1)/(N-1)]..etc}
where N is the number in the universe, n is the number in the sample,
and have repeated this n times to reflect the number of selections.
```

```
Date: 05/13/2003 at 11:07:21
From: Doctor Mitteldorf
Subject: Re: Probability of generating duplicate addresses

Neil -

A quick and practical answer to your question is this: There are
roughly (1/2) 30,000^2 pairs in your sample. If the pairs were
independent, the probability for each one to be a match would be
1/21,000,000. So the mean number of expected matches would be

(1/2) 30,000^2 / 21,000,000 = 21

The standard error in this number is roughly its square root, so you'd
expect 21 +/- 5 duplicates in your 30,000 draws. Because the numbers
are so high, this is quite an accurate estimate.

You might verify the estimate by a numerical experiment: write a
program to draw 30,000 random numbers between 1 and 21,000,000, then
sort the numbers and count the duplicates. Repeat 100 times, and
you'll have a good idea whether the above estimate is valid.

Here are some thoughts on the formal (exact) solution to your problem,
which I formulate thus: Suppose you draw n samples from a universe of
N objects with replacement. What is the probability of exactly r
duplicate pairs?

It is easier to approach this problem by counting permutations
separately, even though in the final analysis order doesn't matter.
The number of possible samples of n is then just N^n. The subset in
which these are all different numbers N!/(N-n)!, so the probability
for r=0 duplicates is

N!
----------
(N-n)! N^n

Suppose there are r duplicate pairs. Then the remaining (n-2r) samples
can be selected (and ordered) from the remaining (N-r) universe
objects in (N-r)!/((N-r)-(n-2r))! = (N-r)!/(N-n+r)! different ways.

Where are the duplicate pairs located in the sample of n? This is the
number of ways of choosing (2r) from a list of n: C(n,2r).

There are C(N,r) ways of choosing which r objects will be the
duplicates.

Finally, how many different ways may the r duplicates be arranged to
make a list of length 2r? The answer is (2r)!/2^r. This is because
there would be (2r)! permutations of the list if all the members were
distinct, but each swapping of a pair means we've overcounted by a
factor of 2.

Putting all this together, the probability of exactly n distinct pairs
(with no triples or quadruples) is

(N-r)! n! N! (2r)!
----------------------------------------   =
(N-n+r)! (n-2r)! (2r)! (N-r)! r! 2^r N^n

N! n!
------------------------------
(N-n+r)! (n-2r)! r! 2^r N^n

Practicing what I preach, I've tested this formula in 1,000,000
numerical trials for N=30 and n=15, r=0,1,2,3, and 4, and find that it
correctly predicts the number of pairs.

- Doctor Mitteldorf, The Math Forum
http://mathforum.org/dr.math/
```

```
Date: 05/16/2003 at 08:25:02
From: Neil
Subject: Thank you (Probability of generating duplicate addresses)

In the end I was able to work this out through trial and error - but
it is lovely to have the formula to hand now and the explanation
behind it. Given that I had found 20 duplications in the sample that I
had drawn I wanted to check whether this was in line with what might
be expected.  I now have a full answer.

- Thank you again.
Neil
```
Associated Topics:
College Probability
High School Permutations and Combinations
High School Probability

Search the Dr. Math Library:

 Find items containing (put spaces between keywords):   Click only once for faster results: [ Choose "whole words" when searching for a word like age.] all keywords, in any order at least one, that exact phrase parts of words whole words

Submit your own question to Dr. Math
Math Forum Home || Math Library || Quick Reference || Math Forum Search