The Math Forum

Ask Dr. Math - Questions and Answers from our Archives
Associated Topics || Dr. Math Home || Search Dr. Math

Distinguishing between Two Random Sequences

Date: 05/09/2008 at 12:43:26
From: Paolo
Subject: Distinguish between (pseudo) random sequences

Is there any efficient way (say polynomial-time algorithm in L) to
distinguish between the two distributions D1 and D2 described below?

- D1: take L elements at random x_1, ... , x_L in [1, N].  Output s =
(x_1, ... , x_L)

- D2: take L elements at random x_1, ... , x_L in [1, N], then pick a
random v in [1, L] and replace x_v with a random element in [1, N/2].
Output s = (x_1, ... , x_L)

To give you an idea of the size of L and N for which the problem is
interesting, here is an example:
- The size of L is about 2^8
- N is in O(2^L)

It's not really confusing, it's just very hard for me to find a proof
that such an algorithm does not exist.  The most efficient way I found
to distinguish between D1 and D2 is this:

Suppose you have a black box that outputs consistently sequences of L
elements from one of the two distributions.  You have to find which
one with probability greater than 1/2 (that is just guessing).

The probability of having a sequence x_1, ... , x_L from D1 in which
for every i in [1, L] x_i > N/2 is 2^-L.  Hence, if the black box
outputs values from D1, I expect that analyzing 2^(L-1) outputs of the
black box I should find, on average, an output in which all x_i are
greater than N/2.  On the other hand, if the black box outputs values
from D2 I will never have all x_i>N/2.

This isn't an answer to the question, because I show there exists an
exponential (in L) algorithm that can distinguish between the two
sequences.  It does not prove that there is no polynomial time
algorithm that can do the same.

Date: 05/11/2008 at 10:32:01
From: Doctor Vogler
Subject: Re: Distinguish between (pseudo) random sequences

Hi Paolo,

Thanks for writing to Dr. Math.  Since neither distribution has any
reason to distinguish between different numbers in [1,N/2], nor any
reason to distinguish between different numbers in (N/2,N], this is
the same as asking how to distinguish between the two distributions:

 - D1': take L elements at random x_1, ... , x_L in {0, 1}.
   Output s = (x_1, ... , x_L)

 - D2': take L elements at random x_1, ... , x_L in {0, 1}.
   Then pick a random v in [1, L] and set x_v = 0.
   Output s = (x_1, ... , x_L)

Of course you realize that any sequence output by the second 
distribution might have also come from the first, and any sequence
output by the first distribution *except* the all-1 sequence might
have also come from the second.

It sounds like you are saying that you get to generate sequences from
both distributions, and you want to be *certain* that you can identify
them correctly.  In that case, you are right that you have to keep
generating sequences from both distributions until you get the all-1
sequence (all numbers > N/2), since that's the only sequence that
*can't* be returned by D2.  And you're right that you will generally
need roughly 2^L sequences before you'll get this one.

But usually the way one distinguishes between two different 
distributions is probabilistically.  For example, suppose D2 only had
a 50% chance of changing x_v.  Then any sequence could be output by
either distribution, even though they are different distributions.  So
the best you usually can do is make a guess and try to find the guess
with the best probability of being correct.  It's also usual to assume
that you have one of these two probabilities, so can only generate
sequences from one black box, and you have to guess whether it is D1
or D2.  But this is only a slightly different question than having
both boxes and trying to guess which is D1 and which is D2.

So if you generate k sequences from your two boxes, then the only
reasonable way to pick is to guess that D2 is the box giving more 0s
and fewer 1s (more numbers in [1,N/2]).  If L is large and k is small,
then this will give you only a slightly better than 50% chance of
guessing correctly.  But if you want something like a 75% chance, or
99% chance of guessing correctly, and you get to increase k in order
to increase your chances, then the question becomes how big to make k
in order to get the success probability you want.

Using Bayes' Theorem

  Wikipedia: Bayes' Theorem'_theorem 

we find that the probability of correctly guessing which box is which
by generating k sequences from each box and asking which has more
zeros, is computed as follows:

  A = first box is D1, second is D2
  B = first box gives more ones, second box gives more zeros

The probability of guessing correctly is

  P(A and B) + P(-A and -B),

which is the same as

  P(B | A) * P(A) + P(-B | -A) * P(-A).

Now, P(A) = 1/2, and P(-A) = 1/2.  Also, P(-B | -A) = P(B | A), so the
above formula equals P(B | A).  (Actually, that presumes that you
don't get exactly the same number of zeros.  But I can't tell you how
to guess in this case.  Of course, the probability of that happening
is very low when L and/or k is large, so it makes little difference.)
The hard part is computing P(B | A).

We have L*k independent random choices between 0 and 1 from the first
box (the D1 box), so the probability of getting exactly n ones is

  (Lk choose n) (1/2)^(Lk).

In the second box, we are guaranteed k zeros, and the other (L-1)k are
independent random choices, so the probability of getting exactly m
ones is

  ((L-1)k choose m) (1/2)^((L-1)k).

Therefore, P(B | A) is the sum

   sum  (Lk choose n) (1/2)^(Lk) * ((L-1)k choose m) (1/2)^((L-1)k)
  n > m

                     (L-1)k   Lk
    = (1/2)^((2L-1)k)  sum   sum  (Lk choose n) ((L-1)k choose m)
                       m=0  n=m+1

I computed this with the free math program GNU Pari, which you can
download from 

by defining the function

f(l,k)=sum( m=0,
           binomial((l-1)*k,m)*sum( n=m+1,
        / 2^((2*l-1)*k)

Using L = 20, I got a success probability of

  k = 5:   61%
  k = 10:  68%
  k = 20:  75%
  k = 30:  80%
  k = 40:  84%
  k = 50:  87%
  k = 60:  89%
  k = 80:  92%
  k = 100: 94%
  and so on....

So you certainly don't need 2^L (which would be k = 1,000,000 in this
case) to get a very good probability of success using this simpler
method of identification.

But if you need certainty, then you have to hold out until you get a
sequence that D2 can't produce.  That is, you can get very high
probability of success in polynomial time, but you can only get
certainty in exponential time.

If you have any questions about this or need more help, please write
back and show me what you have been able to do, and I will try to
offer further suggestions.

- Doctor Vogler, The Math Forum 

Date: 05/11/2008 at 19:35:25
From: Paolo
Subject: Thank you (Distinguish between (pseudo) random sequences)

Thank you very much, this is exactly what I was looking for.  You've
been very very helpful!

Best regards,

Associated Topics:
College Logic
College Probability

Search the Dr. Math Library:

Find items containing (put spaces between keywords):
Click only once for faster results:

[ Choose "whole words" when searching for a word like age.]

all keywords, in any order at least one, that exact phrase
parts of words whole words

Submit your own question to Dr. Math

[Privacy Policy] [Terms of Use]

Math Forum Home || Math Library || Quick Reference || Math Forum Search

Ask Dr. MathTM
© 1994- The Math Forum at NCTM. All rights reserved.