Search All of the Math Forum:
Views expressed in these public forums are not endorsed by
NCTM or The Math Forum.



Re: shuffle without enough memory
Posted:
Nov 9, 2012 1:11 PM


Hi Roger,
I have difficulty to implement your suggestions. My problem is I have 1000 data in columns with real value Pi. The 11000 are column indices. I want to calculate the mean&median values for the triple products across all the columns, and also the biproducts of them, i.e. PiPjPk and PiPj, i j k are different from each other.
To make things faster and without violating the memory in Matlab, I calculate the possible 1e6 combinations of the column indices using the shuffle function. Then I calculate the products according to the shuffled index combinations.
I am wondering if there is an efficient and more accurate way to calculate the mean and median ? According to a previous suggestion from Roger, I may solve it as below cited text. but I have problems to implement it :
First, I generate 1e7 numbers but I found
floor(C(end)/2), (which is of course 83083500,)
this floor 83083500 is bigger than my 1e7 simulated numbers. So How can I proceed using your approach?
If I need to find the triple product value greater or equal to the product value at 83083500, I need to calculate the product values for 1000*999*998/6=166167000. This seems not saving much effort... Any suggestions are highly appreciated,
Best regards, "Roger Stafford" wrote in message <jd5cmk$cmu$1@newscl01ah.mathworks.com>... > "Skirt Zhang" wrote in message <jcvfv8$6pt$1@newscl01ah.mathworks.com>... > > I need to form 10^9 combinations by randomly select three from 1000 integers (1.....1000 refers to the firm identity). After finding the combinations, I refer to single firms' return and calculate the product of the three and get one value, I call it "Prod". So I should have 10^9 "Prod"s. Finally I need to calculate the median value for all these Prods. > > ....... > > Can anyone suggest me a good way to find the median value from the 10^7 prods? >          > It looks as though you are using a "Monte Carlo" method of approximating the median for your random process, but apparently you don't have enough memory to hold a sufficiently representative number of samples. If the median is the only thing you wish to find, there is a way of determining it precisely (as opposed to a Monte Carlo approximation) without providing arrays that are large enough to hold the products of all possible combinations. However it does require that you have a rough initial estimate of that median value. > > Suppose that of the 1000*999*998/6 = 166167000 total number of combinations of 1000 numbers taken three at a time, you have good reason to believe that the median of their products is a value somewhere between about 65 million and 75 million. Using only the 10^7 length array that you have stated would fit in your memory, it is possible to find the precise median of the products of all those combinations, provided this guess is correct. You can proceed along the following lines: > > L = 65e6; > S = 1e7; > C = zeros(S,1); > N = 1000; > for i1 = 1:N2 > for i2 = i1+1:N1 > p12 = i1*i2; > for i3 = i2+1:N > p123 = min(max(p12*i3L+1,1),S); > C(p123) = C(p123)+1; > end > end > end > C = cumsum(C); > > Taking into account its "L1" offset, the array C now contains all the necessary information for obtaining the desired median, provided of course that the above guess was correct. You are looking for the first occurrence in C of a cumulative count that is greater than or equal to the lower midpoint, floor(C(end)/2), (which is of course 83083500,) and with a correct guess above, that should occur somewhere between C(2) and C(end1). (Since 166167000 is even you are also looking for the first occurrence greater than or equal to 83083501 to get a valid median.) I will leave the messy details of doing all that to you. (If need be, these details can be expanded upon in subsequent postings.) > > As to justifying the above guess, one can reason as follows. First, N=1000 is sufficiently large that we can include cases where the numbers are repeated among the triplets without affecting the resulting median seriously. Moreover for the same reason we can assume a continuum of values between 0 and 1000 instead of discrete integers. This makes it a problem in calculus. Thus we then have a constant probability density of 1/N^3 in R^3 inside the cube 0<=x<=N, 0<=y<=N, and 0<=z<=N. We want the median of the product f(x,y,z) = x*y*z in that cube. The cumulative distribution function (CDF) is: > > F(c) = Pr(x*y*z<=c) = 1  Pr(x*y*z>c) = 1  Pr(u*v*w>t) > > where u = x/N, v = y/N, w = z/N, and t = c/N^3 and where the density of u, v, w over its unit cube is just 1. We can obtain an exact answer for the latter quantity (courtesy of the Symbolic Toolbox): > > F(c) = 1int(int(int(1,'w',t/(u*v),1),'v',t/u,1),'u',t,1) = > t*(1log(t)+1/2*log(t)^2) > > The value of t for which this expression equals 1/2 (for the median) is: > > t = .6897160962790907e1 > > so that c at the median would be c = t*N^3 = 68971609.62790907 . > > That is the justification for using the above estimate between 65 and 75 millions. Note that, depending on how accurate this estimate is, one can use shorter arrays than above if necessary. Of course if an estimate should prove to be sufficiently in error that the above nested forloops fail to entrap the median in C(2:end1), appropriate adjustments can be made with the value 'L' for other tries. > > Roger Stafford



