Distinguishing between Two Random SequencesDate: 05/09/2008 at 12:43:26 From: Paolo Subject: Distinguish between (pseudo) random sequences Is there any efficient way (say polynomial-time algorithm in L) to distinguish between the two distributions D1 and D2 described below? - D1: take L elements at random x_1, ... , x_L in [1, N]. Output s = (x_1, ... , x_L) - D2: take L elements at random x_1, ... , x_L in [1, N], then pick a random v in [1, L] and replace x_v with a random element in [1, N/2]. Output s = (x_1, ... , x_L) To give you an idea of the size of L and N for which the problem is interesting, here is an example: - The size of L is about 2^8 - N is in O(2^L) It's not really confusing, it's just very hard for me to find a proof that such an algorithm does not exist. The most efficient way I found to distinguish between D1 and D2 is this: Suppose you have a black box that outputs consistently sequences of L elements from one of the two distributions. You have to find which one with probability greater than 1/2 (that is just guessing). The probability of having a sequence x_1, ... , x_L from D1 in which for every i in [1, L] x_i > N/2 is 2^-L. Hence, if the black box outputs values from D1, I expect that analyzing 2^(L-1) outputs of the black box I should find, on average, an output in which all x_i are greater than N/2. On the other hand, if the black box outputs values from D2 I will never have all x_i>N/2. This isn't an answer to the question, because I show there exists an exponential (in L) algorithm that can distinguish between the two sequences. It does not prove that there is no polynomial time algorithm that can do the same. Date: 05/11/2008 at 10:32:01 From: Doctor Vogler Subject: Re: Distinguish between (pseudo) random sequences Hi Paolo, Thanks for writing to Dr. Math. Since neither distribution has any reason to distinguish between different numbers in [1,N/2], nor any reason to distinguish between different numbers in (N/2,N], this is the same as asking how to distinguish between the two distributions: - D1': take L elements at random x_1, ... , x_L in {0, 1}. Output s = (x_1, ... , x_L) - D2': take L elements at random x_1, ... , x_L in {0, 1}. Then pick a random v in [1, L] and set x_v = 0. Output s = (x_1, ... , x_L) Of course you realize that any sequence output by the second distribution might have also come from the first, and any sequence output by the first distribution *except* the all-1 sequence might have also come from the second. It sounds like you are saying that you get to generate sequences from both distributions, and you want to be *certain* that you can identify them correctly. In that case, you are right that you have to keep generating sequences from both distributions until you get the all-1 sequence (all numbers > N/2), since that's the only sequence that *can't* be returned by D2. And you're right that you will generally need roughly 2^L sequences before you'll get this one. But usually the way one distinguishes between two different distributions is probabilistically. For example, suppose D2 only had a 50% chance of changing x_v. Then any sequence could be output by either distribution, even though they are different distributions. So the best you usually can do is make a guess and try to find the guess with the best probability of being correct. It's also usual to assume that you have one of these two probabilities, so can only generate sequences from one black box, and you have to guess whether it is D1 or D2. But this is only a slightly different question than having both boxes and trying to guess which is D1 and which is D2. So if you generate k sequences from your two boxes, then the only reasonable way to pick is to guess that D2 is the box giving more 0s and fewer 1s (more numbers in [1,N/2]). If L is large and k is small, then this will give you only a slightly better than 50% chance of guessing correctly. But if you want something like a 75% chance, or 99% chance of guessing correctly, and you get to increase k in order to increase your chances, then the question becomes how big to make k in order to get the success probability you want. Using Bayes' Theorem Wikipedia: Bayes' Theorem http://en.wikipedia.org/wiki/Bayes'_theorem we find that the probability of correctly guessing which box is which by generating k sequences from each box and asking which has more zeros, is computed as follows: A = first box is D1, second is D2 B = first box gives more ones, second box gives more zeros The probability of guessing correctly is P(A and B) + P(-A and -B), which is the same as P(B | A) * P(A) + P(-B | -A) * P(-A). Now, P(A) = 1/2, and P(-A) = 1/2. Also, P(-B | -A) = P(B | A), so the above formula equals P(B | A). (Actually, that presumes that you don't get exactly the same number of zeros. But I can't tell you how to guess in this case. Of course, the probability of that happening is very low when L and/or k is large, so it makes little difference.) The hard part is computing P(B | A). We have L*k independent random choices between 0 and 1 from the first box (the D1 box), so the probability of getting exactly n ones is (Lk choose n) (1/2)^(Lk). In the second box, we are guaranteed k zeros, and the other (L-1)k are independent random choices, so the probability of getting exactly m ones is ((L-1)k choose m) (1/2)^((L-1)k). Therefore, P(B | A) is the sum sum (Lk choose n) (1/2)^(Lk) * ((L-1)k choose m) (1/2)^((L-1)k) n > m (L-1)k Lk = (1/2)^((2L-1)k) sum sum (Lk choose n) ((L-1)k choose m) m=0 n=m+1 I computed this with the free math program GNU Pari, which you can download from http://pari.math.u-bordeaux.fr/ by defining the function f(l,k)=sum( m=0, (l-1)*k, binomial((l-1)*k,m)*sum( n=m+1, l*k, 1.0*binomial(l*k,n))) / 2^((2*l-1)*k) Using L = 20, I got a success probability of k = 5: 61% k = 10: 68% k = 20: 75% k = 30: 80% k = 40: 84% k = 50: 87% k = 60: 89% k = 80: 92% k = 100: 94% and so on.... So you certainly don't need 2^L (which would be k = 1,000,000 in this case) to get a very good probability of success using this simpler method of identification. But if you need certainty, then you have to hold out until you get a sequence that D2 can't produce. That is, you can get very high probability of success in polynomial time, but you can only get certainty in exponential time. If you have any questions about this or need more help, please write back and show me what you have been able to do, and I will try to offer further suggestions. - Doctor Vogler, The Math Forum http://mathforum.org/dr.math/ Date: 05/11/2008 at 19:35:25 From: Paolo Subject: Thank you (Distinguish between (pseudo) random sequences) Thank you very much, this is exactly what I was looking for. You've been very very helpful! Best regards, Paolo |
Search the Dr. Math Library: |
[Privacy Policy] [Terms of Use]
Ask Dr. Math^{TM}
© 1994- The Math Forum at NCTM. All rights reserved.
http://mathforum.org/dr.math/