
Re: shuffle without enough memory
Posted:
Dec 26, 2011 8:20 AM


"Roger Stafford" wrote in message <jd5cmk$cmu$1@newscl01ah.mathworks.com>... > "Skirt Zhang" wrote in message <jcvfv8$6pt$1@newscl01ah.mathworks.com>... > > I need to form 10^9 combinations by randomly select three from 1000 integers (1.....1000 refers to the firm identity). After finding the combinations, I refer to single firms' return and calculate the product of the three and get one value, I call it "Prod". So I should have 10^9 "Prod"s. Finally I need to calculate the median value for all these Prods. > > ....... > > Can anyone suggest me a good way to find the median value from the 10^7 prods? >          > It looks as though you are using a "Monte Carlo" method of approximating the median for your random process, but apparently you don't have enough memory to hold a sufficiently representative number of samples. If the median is the only thing you wish to find, there is a way of determining it precisely (as opposed to a Monte Carlo approximation) without providing arrays that are large enough to hold the products of all possible combinations. However it does require that you have a rough initial estimate of that median value. > > Suppose that of the 1000*999*998/6 = 166167000 total number of combinations of 1000 numbers taken three at a time, you have good reason to believe that the median of their products is a value somewhere between about 65 million and 75 million. Using only the 10^7 length array that you have stated would fit in your memory, it is possible to find the precise median of the products of all those combinations, provided this guess is correct. You can proceed along the following lines: > > L = 65e6; > S = 1e7; > C = zeros(S,1); > N = 1000; > for i1 = 1:N2 > for i2 = i1+1:N1 > p12 = i1*i2; > for i3 = i2+1:N > p123 = min(max(p12*i3L+1,1),S); > C(p123) = C(p123)+1; > end > end > end > C = cumsum(C); > > Taking into account its "L1" offset, the array C now contains all the necessary information for obtaining the desired median, provided of course that the above guess was correct. You are looking for the first occurrence in C of a cumulative count that is greater than or equal to the lower midpoint, floor(C(end)/2), (which is of course 83083500,) and with a correct guess above, that should occur somewhere between C(2) and C(end1). (Since 166167000 is even you are also looking for the first occurrence greater than or equal to 83083501 to get a valid median.) I will leave the messy details of doing all that to you. (If need be, these details can be expanded upon in subsequent postings.) > > As to justifying the above guess, one can reason as follows. First, N=1000 is sufficiently large that we can include cases where the numbers are repeated among the triplets without affecting the resulting median seriously. Moreover for the same reason we can assume a continuum of values between 0 and 1000 instead of discrete integers. This makes it a problem in calculus. Thus we then have a constant probability density of 1/N^3 in R^3 inside the cube 0<=x<=N, 0<=y<=N, and 0<=z<=N. We want the median of the product f(x,y,z) = x*y*z in that cube. The cumulative distribution function (CDF) is: > > F(c) = Pr(x*y*z<=c) = 1  Pr(x*y*z>c) = 1  Pr(u*v*w>t) > > where u = x/N, v = y/N, w = z/N, and t = c/N^3 and where the density of u, v, w over its unit cube is just 1. We can obtain an exact answer for the latter quantity (courtesy of the Symbolic Toolbox): > > F(c) = 1int(int(int(1,'w',t/(u*v),1),'v',t/u,1),'u',t,1) = > t*(1log(t)+1/2*log(t)^2) > > The value of t for which this expression equals 1/2 (for the median) is: > > t = .6897160962790907e1 > > so that c at the median would be c = t*N^3 = 68971609.62790907 . > > That is the justification for using the above estimate between 65 and 75 millions. Note that, depending on how accurate this estimate is, one can use shorter arrays than above if necessary. Of course if an estimate should prove to be sufficiently in error that the above nested forloops fail to entrap the median in C(2:end1), appropriate adjustments can be made with the value 'L' for other tries. > > Roger Stafford
Hi Roger,
I am trying to implement your idea. However, for the 1 to 1000 integers I just use them for indexing/representing the firm's identity, i.e. 2 means firm 2's return. So I need to find the median for all the Prods' of 1000 returns, e.g. R1R2R3.....
Does this procedure mean you are calculating the Prod's of any 3 integers from 1:1000? p123 = min(max(p12*i3L+1,1),S); C(p123) = C(p123)+1;

