Drexel dragonThe Math ForumDonate to the Math Forum



Search All of the Math Forum:

Views expressed in these public forums are not endorsed by Drexel University or The Math Forum.


Math Forum » Discussions » Software » comp.soft-sys.matlab

Topic: Parallel Computing Toolbox - Random numbers generation within tasks - a seed issue...
Replies: 8   Last Post: Apr 16, 2013 11:05 AM

Advanced Search

Back to Topic List Back to Topic List Jump to Tree View Jump to Tree View   Messages: [ Previous | Next ]
Peter Perkins

Posts: 117
Registered: 8/12/11
Re: Parallel Computing Toolbox - Random numbers generation within
Posted: Apr 15, 2013 9:22 PM
  Click to see the message monospaced in plain text Plain Text   Click to reply to this topic Reply

If you're doing that many parallel simulations, you should be using the
generator in the way in which it was designed. It has parallelism
designed into it in the form of 2^53*2^64 substreams to choose from.
The mrg32k3a generator was not designed to be parallelized via seeds.

Will it matter? Who can say. But the generator was tested to verify that
it gives statistical independence between streams and substreams, not
between different seeds.


On 4/13/2013 10:05 AM, Gabriele wrote:
> Dear Peter,
> the reason why I need of shuffle the seed is very simple.
>
> Suppose that the code work this way:
> 1) First it generates one job with two tasks
> 2) The job is then submitted
> 3) After completion, the job is deleted
>
> In such a case the job, for instance, has jobID=1, then the taskID are 1
> and 2.
> The stream index is then 1, with substream indices 1 and two for the two
> tasks respectively.
>
> When the job is deleted, the jobID is removed. This means that, if I run
> again the same code, the new jobID can be, again, 1, with taskID 1 and 2
> (again).
>
> In such a situation (which is actually my real situation), if you don't
> use a shuffle the random number generator will generate again exactly
> the same random numbers for task 1 and task 2.
>
> As a result, without shuffle, whenever the couple (jobID,taskID) is the
> same, the same random numbers are generated.
> If you work with the local cluster profile, it is common you delete your
> jobs after completion and after gathering results. This means that it is
> not uncommon you start your new set of jobs from jobID=1. As a result,
> without using shuffle, it is very common you always generate the same
> random number series.
>
> In addition, for the same reason, it is very common (I would say almost
> sure) that the stream indices and substream indices you are going to use
> will be the low ones (I cannot imagine a relatively standard system
> where the jobID/taskID have reached order of magnitudes of, e.g.,
> 2^30...). For this reason I was proposing the option B which tends to
> sift the stream and substream indices towards higher values.
>
> What do you think?
>
> Moreover, any comment on the 2^31-1 matter in the shuffle algorithm in
> the RandStream class?
>
> Thanks,
> Gabriele
>
> Peter Perkins <Peter.Remove.Perkins.This@mathworks.com> wrote in message
> <kk95qg$e76$1@newscl01ah.mathworks.com>...

>> >> However, in general this approach tends to use always the "low"
>> >> stream/substream (because usually the jobID and taskID are relatively
>> >> small numbers comapred to the max number of streams / substreams).

>>
>> Why do you care? The proper statistical properties are already built
>> into the algorithms. Just let that happen.
>>

>> >> s = RandStream.create('mrg32k3a', 'NumStreams', 2^63, 'Stream',
>> jobID, 'seed', 'shuffle');
>>
>> It's hard to imagine that with 2^63 streams and 2^51 substreams, you
>> really need to shuffle the seed. In any case:
>>

>> >> help randstream.create
>> RandStream.create Create multiple independent random number streams.
>> [snip]
>> 'NumStreams', 'StreamIndices', and 'Seed' can be used to ensure
>> that multiple streams created at different times are independent.
>> Streams of the same type and created using the same value for
>> 'NumStreams' and 'Seed', but with different values of
>> 'StreamIndices', are independent even if they were created in
>> separate calls to RandStream.create.
>>
>> The converse of that, probably not stated clearly enough (I will make
>> a note to have that improved), is that if you use different seeds for
>> the parallel generators, then all bets are off as far as independence
>> goes. You are perhaps OK, but mrg32k3a was designed to use achieve
>> independence using streams/substreams with the the same seed.
>>
>>
>> On 4/11/2013 10:06 AM, Gabriele wrote:

>> > Hi Peter,
>> > sorry for the late reply, but I was doing some testing.
>> > Thanks for the suggestions.
>> >
>> > I had some exchange of e-mails with the matlab support, and I received
>> > some good suggestions on this point.
>> >
>> > Such suggestions goes along your line, i.e. using stream and substream.
>> >
>> > Now I have mainly two possibilities to select from.
>> >
>> > Option A:
>> > Prepare a file taskStartup.m embedding the following code
>> >
>> > %------------------
>> > taskID = task.ID;
>> > job = task.Parent;
>> > jobID = job.ID;
>> >
>> > s = RandStream.create('mrg32k3a', 'NumStreams', 2^63, 'Stream',
>> > jobID, 'seed', 'shuffle');
>> > s.Substream = taskID;
>> > RandStream.setGlobalStream(s); %------------------
>> >
>> > Basically, the stream is selected on the basis of the jobID and the
>> > substream is selected on the basis of the taskID. In addition to this,
>> > the seed of the stream s is based on the clock time.
>> > My feeling is that should work, because even if the jobId and the

>> taskID
>> > are the same for two different calculations (suppose a previous
>> > (job,task) was properly deleted), the "shuffle" command should modified
>> > the seed, this leading to different generations.
>> >
>> > However, in general this approach tends to use always the "low"
>> > stream/substream (because usually the jobID and taskID are relatively
>> > small numbers comapred to the max number of streams / substreams).
>> >
>> > An alternative I was thinking of would be as follows:
>> >
>> > Option B: %-----------------------------------
>> >
>> > %get task ID, (parent) job ID and job creation time
>> > taskID = task.ID;
>> > job = task.Parent;
>> > jobID = job.ID;
>> > job_creation_time=job.CreateTime;
>> >
>> > %convert the job creation time to seconds
>> > job_creation_time=job_creation_time([1:20,26:29]); %remove the time

>> zone
>> > job_creation_time=round(datenum(job_creation_time,'ddd mmm dd HH:MM:SS
>> > yyyy')*86400); %convert (units: seconds)
>> > %create shift indices using the job creation time:
>> > shift_index_stream=job_creation_time; %shift index for stream
>> > shift_index_substream=round(job_creation_time/1000); %shift index for
>> > substream
>> > %Now:
>> > %1) Create a large number of independent streams;
>> > %2) Select the stream using shift_index_stream and jobID
>> > %3) Generate also a random seed using "shuffle"
>> > %4) For this particular task use a substream identified by
>> > % shift_index_substream and taskID
>> > %
>> > NS=2^63; %number of multiple independent streams
>> > s = RandStream.create('mrg32k3a', 'NumStreams', NS, 'Stream',
>> > shift_index_stream+jobID,'seed','shuffle');
>> > s.Substream = shift_index_substream+taskID;
>> > RandStream.setGlobalStream(s);
>> > %-----------------------------------
>> >
>> > The idea is to create a shifting of indices for the stream and
>> > substreams. Such shifting is is based on the jobID and the job creation
>> > time for the stream, and on the taskID and the job creation time for

>> the
>> > substream. On top of this, "shuffle" is used to create a seed which is
>> > based on the task startup time (since taskStartup.m is called at the
>> > starting of the task).
>> > This looks like "mixing" things a little bit more (since the index of
>> > streams/substreams which is used more diverse). However, I'm not sure
>> > this is giving correct statistical properties.
>> >
>> > So, the question is: considering that both seems to do the job, is it
>> > better using option A or option B?
>> >
>> > Thanks,
>> > Gabriele
>> >
>> > PS: I have noticed something looking strange to me in the definition of
>> > the class RandStream, where the shuffle algorithm is implemented
>> > (function seed = shuffleSeed).
>> > I see the following :
>> > line #733: seed0 = mod(floor(now*8640000),2^31-1);
>> > line #735: seed = mod(floor(now*8640000),2^31-1);
>> >
>> > However, considering the seed can be any number smaller than 2^32, I
>> > would have expected:
>> > line #733: seed0 = mod(floor(now*8640000),2^32-1);
>> > line #735: seed = mod(floor(now*8640000),2^32-1);
>> >
>> > why is the shuffle seed limited to 2^31-1?
>> >
>> > Peter Perkins <Peter.Remove.Perkins.This@mathworks.com> wrote in

>> message
>> > <kjem6b$e28$1@newscl01ah.mathworks.com>...
>> >> Gabriele, you're doing large-scale parallel simulations. You should be
>> >> using the right tools for that. Setting seeds based on current time or
>> >> whatever is like throwing darts at a dartboard. You need something
>> >> more controlled.
>> >>
>> >> MATLAB includes two random number generators, mrg32k3a and mldfg6331,
>> >> that are specifically designed for the kind of thing you're doing.
>> >> They both support multiple independent streams and substreams. (the
>> >> latter is more or less a lighterweight version of the former). I can't
>> >> really follow all of the "topology" that you describe, but by basing
>> >> the stream (or substream) index on the tasks, or workers, or runs, you
>> >> can ensure that you don't reuse the same random numbers.
>> >>
>> >> This is described at length in a couple of blog posts:
>> >>
>> >>

>> http://blogs.mathworks.com/loren/2008/11/05/new-ways-with-random-numbers-part-i
>>

>> >>
>> >>

>> http://blogs.mathworks.com/loren/2008/11/13/new-ways-with-random-numbers-part-ii
>>

>> >>
>> >>
>> >> I hope this is helpful.
>> >>
>> >>
>> >>
>> >>
>> >> On 3/25/2013 6:34 AM, Gabriele wrote:

>> >> > Hi All,
>> >> > I am having some problems in consistently generating random numbers
>> >> > within tasks.
>> >> > I suppose my problems come from the fact that it is not clear to

>> me how
>> >> > the seed for the stream is handled by the tasks belonging to a job.
>> >> >
>> >> > So, to make a long story short, and to simplify the problem, I

>> have a
>> >> > job, which comprises some tasks. Each task is generating random
>> >> numbers.
>> >> > I would like, of course, that:
>> >> > 1) Generated (pseudo-)random numbers are different from task to task
>> >> > (actually also between tasks belonging to different jobs);
>> >> > 2) Generated (pseudo-)random numbers are different if I run the code

>> >> twice.
>> >> >
>> >> > Unfortunately, I cannot manage to get both.
>> >> > If I just make a "plain code" (simply calling, e.g., "rand" from

>> each
>> >> > task) I achieve 1), but I do not achieve 2), i.e. outcomes from
>> tasks
>> >> > are different, but If I run the code twice, I get exactly the same
>> >> > outcomes.
>> >> > If I try to force the seed (using, e.g., rng('shuffle')) I have the
>> >> > problem that, in some cases, different tasks (typically 2 tasks)

>> seem
>> >> > like starting at the same time (within the accuracy of the "shuffle"
>> >> > algorithm, which seems to be 1/100s looking at randstream.m). As a
>> >> > result, some outcomes are different, while other are the same.
>> >> >
>> >> > I tried putting a rng('shuffle') command in jobStartup.m and in
>> >> > taskStartup.m, but I couldn't achieve a robust result fulfilling 1)

>> >> & 2)
>> >> > above. It is not clear to me how an rng(something) command in
>> >> > jobStartup.m affects the tasks
>> >> >
>> >> > I have also tried passing the seed as a parameter to each task, by
>> >> > creating the seed for each task on the basis of the progressive

>> task's
>> >> > number (say the task ID...). However, this is not very robust,
>> because
>> >> > if you start your code twice, the number of tasks, combined with the
>> >> > time difference of the two runs can lead to partially identically
>> >> > results (this is because in one case you use, e.g.,
>> >> > seed=time+task_number, in the second case you use
>> >> > seed=time+delta_time+task_number, and for a given delta_time and two
>> >> > different task_number you could get the same seed).
>> >> >
>> >> > So, this is the problem.
>> >> > I post below a code which reproduce the issue, at least on my

>> hardware.
>> >> > In my case the local profile run 4 workers (plus one client)
>> because I
>> >> > have a quad core. Note that the issue does not happen always, so it
>> >> > might be necessary to run the code a few times to see a

>> repetition in
>> >> > the generation.
>> >> > As you will see, in the task creation there are 4 options. Note

>> that:
>> >> > - option 4 does not lead to repetition in my case, but results
>> are the
>> >> > same at each run (looks like the starting seed for the tasks is
>> always
>> >> > the same...i.e. 0). So this option is not usable.
>> >> > - option 3: in my case leads to some repetitions in the generated
>> >> > numbers. So this is not working.
>> >> > - option 2: can potentially lead to repetitions if the operations

>> >> within
>> >> > the for-loop are faster than the "shuffle time accuracy". In my
>> case I
>> >> > have not noticed any repetition, so this looks like the preferable
>> >> > option...but I am not 100% sure...a possibility would be to add a
>> >> > pause(0.01) command in the loop (just to be sure), but this is not
>> >> > fantastic...
>> >> > - option 1: can potentially lead to repetitions between different

>> runs
>> >> > of the code
>> >> >
>> >> > a global alternative would be to create seed beforehand for each

>> >> task...
>> >> >
>> >> > ok, the code is below...
>> >> >
>> >> > %-------------
>> >> > %Main script
>> >> > %
>> >> >
>> >> > %% Identify a cluster:
>> >> > parallel.defaultClusterProfile('local');
>> >> > c = parcluster();
>> >> >
>> >> > %% Create a job
>> >> > j = createJob(c);
>> >> >
>> >> > %% Create tasks within a job
>> >> > %test random number generation
>> >> > Ntests=6*5;
>> >> > for jtest=1:Ntests, %create Ntests tasks
>> >> >
>> >> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
>> >> > {[3,1],getfield(rng,'Seed')+jtest}); %option 1: fix the seed from

>> the
>> >> > main script on the basis of the seed of the client
>> >> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
>> >> > {[3,1],getfield(rng(rng('shuffle')),'Seed')}); %option 2:

>> generate the
>> >> > seed using "shuffle" at this moment
>> >> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],[]});

>> >> %option
>> >> > 3: let the function generating the seed internally, using shuffle
>> >> > t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],-1});

>> %option
>> >> > 4: let the task using the seed it is supposed to use
>> >> >
>> >> > end;
>> >> >
>> >> > %% Submit the job to the queue
>> >> > submit(j);
>> >> >
>> >> > %% Wait for the job to complete:
>> >> > wait(j)
>> >> >
>> >> > %% Get results
>> >> > results = fetchOutputs(j);
>> >> >
>> >> > %% Delete the job and permanently remove the job from the

>> scheduler's
>> >> > storage location
>> >> > delete(j)
>> >> >
>> >> > %% Check the output
>> >> > %if two columns are equal, it means the corresponding tasks started

>> >> from
>> >> > the same %random seed...which is something not wanted!
>> >> > fprintf('\nIf two columns are equal, this is bad...')
>> >> > final_data=[results{:}]
>> >> > if any(diff(sort(final_data(1,:)))==0), %checking the first line is
>> >> > sufficient in this case
>> >> > fprintf('\n...there is a generation problem!\n');
>> >> > else
>> >> > fprintf('\n...this generation seems to be ok!\n');
>> >> > end;
>> >> >
>> >> > %----------------------------
>> >> >
>> >> > %---------------------------
>> >> > %Additional function
>> >> >
>> >> > function out=f_myrand_with_seed(dim,sd)
>> >> >
>> >> > if nargin>1 && ~isempty(sd),
>> >> > if sd>0, %change the seed to the required value
>> >> > rng(sd);
>> >> > end; %note that, when sd<0 we do NOTHING
>> >> > else
>> >> > rng('shuffle'); %use the clock-based seed
>> >> > end;
>> >> > out=rand(dim);
>> >> > %out=rng;out=out.Seed; %use this line to have the seed from the

>> present
>> >> > task
>> >> > %-----------------------------
>> >> >
>> >> > thanks for your comments...
>> >> >
>> >> > bye,
>> >> > gabriele





Point your RSS reader here for a feed of the latest messages in this topic.

[Privacy Policy] [Terms of Use]

© Drexel University 1994-2014. All Rights Reserved.
The Math Forum is a research and educational enterprise of the Drexel University School of Education.