Drexel dragonThe Math ForumDonate to the Math Forum



Search All of the Math Forum:

Views expressed in these public forums are not endorsed by Drexel University or The Math Forum.


Math Forum » Discussions » Software » comp.soft-sys.matlab

Topic: Parallel Computing Toolbox - Random numbers generation within tasks - a seed issue...
Replies: 8   Last Post: Apr 16, 2013 11:05 AM

Advanced Search

Back to Topic List Back to Topic List Jump to Tree View Jump to Tree View   Messages: [ Previous | Next ]
Peter Perkins

Posts: 113
Registered: 8/12/11
Re: Parallel Computing Toolbox - Random numbers generation within
Posted: Apr 12, 2013 10:30 AM
  Click to see the message monospaced in plain text Plain Text   Click to reply to this topic Reply

>> However, in general this approach tends to use always the "low"
>> stream/substream (because usually the jobID and taskID are relatively
>> small numbers comapred to the max number of streams / substreams).


Why do you care? The proper statistical properties are already built
into the algorithms. Just let that happen.

>> s = RandStream.create('mrg32k3a', 'NumStreams', 2^63, 'Stream', jobID, 'seed', 'shuffle');

It's hard to imagine that with 2^63 streams and 2^51 substreams, you
really need to shuffle the seed. In any case:

>> help randstream.create
RandStream.create Create multiple independent random number streams.
[snip]
'NumStreams', 'StreamIndices', and 'Seed' can be used to ensure
that multiple streams created at different times are independent.
Streams of the same type and created using the same value for
'NumStreams' and 'Seed', but with different values of
'StreamIndices', are independent even if they were created in
separate calls to RandStream.create.

The converse of that, probably not stated clearly enough (I will make a
note to have that improved), is that if you use different seeds for the
parallel generators, then all bets are off as far as independence goes.
You are perhaps OK, but mrg32k3a was designed to use achieve
independence using streams/substreams with the the same seed.


On 4/11/2013 10:06 AM, Gabriele wrote:
> Hi Peter,
> sorry for the late reply, but I was doing some testing.
> Thanks for the suggestions.
>
> I had some exchange of e-mails with the matlab support, and I received
> some good suggestions on this point.
>
> Such suggestions goes along your line, i.e. using stream and substream.
>
> Now I have mainly two possibilities to select from.
>
> Option A:
> Prepare a file taskStartup.m embedding the following code
>
> %------------------
> taskID = task.ID;
> job = task.Parent;
> jobID = job.ID;
>
> s = RandStream.create('mrg32k3a', 'NumStreams', 2^63, 'Stream',
> jobID, 'seed', 'shuffle');
> s.Substream = taskID;
> RandStream.setGlobalStream(s); %------------------
>
> Basically, the stream is selected on the basis of the jobID and the
> substream is selected on the basis of the taskID. In addition to this,
> the seed of the stream s is based on the clock time.
> My feeling is that should work, because even if the jobId and the taskID
> are the same for two different calculations (suppose a previous
> (job,task) was properly deleted), the "shuffle" command should modified
> the seed, this leading to different generations.
>
> However, in general this approach tends to use always the "low"
> stream/substream (because usually the jobID and taskID are relatively
> small numbers comapred to the max number of streams / substreams).
>
> An alternative I was thinking of would be as follows:
>
> Option B: %-----------------------------------
>
> %get task ID, (parent) job ID and job creation time
> taskID = task.ID;
> job = task.Parent;
> jobID = job.ID;
> job_creation_time=job.CreateTime;
>
> %convert the job creation time to seconds
> job_creation_time=job_creation_time([1:20,26:29]); %remove the time zone
> job_creation_time=round(datenum(job_creation_time,'ddd mmm dd HH:MM:SS
> yyyy')*86400); %convert (units: seconds)
> %create shift indices using the job creation time:
> shift_index_stream=job_creation_time; %shift index for stream
> shift_index_substream=round(job_creation_time/1000); %shift index for
> substream
> %Now:
> %1) Create a large number of independent streams;
> %2) Select the stream using shift_index_stream and jobID
> %3) Generate also a random seed using "shuffle"
> %4) For this particular task use a substream identified by
> % shift_index_substream and taskID
> %
> NS=2^63; %number of multiple independent streams
> s = RandStream.create('mrg32k3a', 'NumStreams', NS, 'Stream',
> shift_index_stream+jobID,'seed','shuffle');
> s.Substream = shift_index_substream+taskID;
> RandStream.setGlobalStream(s);
> %-----------------------------------
>
> The idea is to create a shifting of indices for the stream and
> substreams. Such shifting is is based on the jobID and the job creation
> time for the stream, and on the taskID and the job creation time for the
> substream. On top of this, "shuffle" is used to create a seed which is
> based on the task startup time (since taskStartup.m is called at the
> starting of the task).
> This looks like "mixing" things a little bit more (since the index of
> streams/substreams which is used more diverse). However, I'm not sure
> this is giving correct statistical properties.
>
> So, the question is: considering that both seems to do the job, is it
> better using option A or option B?
>
> Thanks,
> Gabriele
>
> PS: I have noticed something looking strange to me in the definition of
> the class RandStream, where the shuffle algorithm is implemented
> (function seed = shuffleSeed).
> I see the following :
> line #733: seed0 = mod(floor(now*8640000),2^31-1);
> line #735: seed = mod(floor(now*8640000),2^31-1);
>
> However, considering the seed can be any number smaller than 2^32, I
> would have expected:
> line #733: seed0 = mod(floor(now*8640000),2^32-1);
> line #735: seed = mod(floor(now*8640000),2^32-1);
>
> why is the shuffle seed limited to 2^31-1?
>
> Peter Perkins <Peter.Remove.Perkins.This@mathworks.com> wrote in message
> <kjem6b$e28$1@newscl01ah.mathworks.com>...

>> Gabriele, you're doing large-scale parallel simulations. You should be
>> using the right tools for that. Setting seeds based on current time or
>> whatever is like throwing darts at a dartboard. You need something
>> more controlled.
>>
>> MATLAB includes two random number generators, mrg32k3a and mldfg6331,
>> that are specifically designed for the kind of thing you're doing.
>> They both support multiple independent streams and substreams. (the
>> latter is more or less a lighterweight version of the former). I can't
>> really follow all of the "topology" that you describe, but by basing
>> the stream (or substream) index on the tasks, or workers, or runs, you
>> can ensure that you don't reuse the same random numbers.
>>
>> This is described at length in a couple of blog posts:
>>
>> http://blogs.mathworks.com/loren/2008/11/05/new-ways-with-random-numbers-part-i
>>
>> http://blogs.mathworks.com/loren/2008/11/13/new-ways-with-random-numbers-part-ii
>>
>>
>> I hope this is helpful.
>>
>>
>>
>>
>> On 3/25/2013 6:34 AM, Gabriele wrote:

>> > Hi All,
>> > I am having some problems in consistently generating random numbers
>> > within tasks.
>> > I suppose my problems come from the fact that it is not clear to me how
>> > the seed for the stream is handled by the tasks belonging to a job.
>> >
>> > So, to make a long story short, and to simplify the problem, I have a
>> > job, which comprises some tasks. Each task is generating random

>> numbers.
>> > I would like, of course, that:
>> > 1) Generated (pseudo-)random numbers are different from task to task
>> > (actually also between tasks belonging to different jobs);
>> > 2) Generated (pseudo-)random numbers are different if I run the code

>> twice.
>> >
>> > Unfortunately, I cannot manage to get both.
>> > If I just make a "plain code" (simply calling, e.g., "rand" from each
>> > task) I achieve 1), but I do not achieve 2), i.e. outcomes from tasks
>> > are different, but If I run the code twice, I get exactly the same
>> > outcomes.
>> > If I try to force the seed (using, e.g., rng('shuffle')) I have the
>> > problem that, in some cases, different tasks (typically 2 tasks) seem
>> > like starting at the same time (within the accuracy of the "shuffle"
>> > algorithm, which seems to be 1/100s looking at randstream.m). As a
>> > result, some outcomes are different, while other are the same.
>> >
>> > I tried putting a rng('shuffle') command in jobStartup.m and in
>> > taskStartup.m, but I couldn't achieve a robust result fulfilling 1)

>> & 2)
>> > above. It is not clear to me how an rng(something) command in
>> > jobStartup.m affects the tasks
>> >
>> > I have also tried passing the seed as a parameter to each task, by
>> > creating the seed for each task on the basis of the progressive task's
>> > number (say the task ID...). However, this is not very robust, because
>> > if you start your code twice, the number of tasks, combined with the
>> > time difference of the two runs can lead to partially identically
>> > results (this is because in one case you use, e.g.,
>> > seed=time+task_number, in the second case you use
>> > seed=time+delta_time+task_number, and for a given delta_time and two
>> > different task_number you could get the same seed).
>> >
>> > So, this is the problem.
>> > I post below a code which reproduce the issue, at least on my hardware.
>> > In my case the local profile run 4 workers (plus one client) because I
>> > have a quad core. Note that the issue does not happen always, so it
>> > might be necessary to run the code a few times to see a repetition in
>> > the generation.
>> > As you will see, in the task creation there are 4 options. Note that:
>> > - option 4 does not lead to repetition in my case, but results are the
>> > same at each run (looks like the starting seed for the tasks is always
>> > the same...i.e. 0). So this option is not usable.
>> > - option 3: in my case leads to some repetitions in the generated
>> > numbers. So this is not working.
>> > - option 2: can potentially lead to repetitions if the operations

>> within
>> > the for-loop are faster than the "shuffle time accuracy". In my case I
>> > have not noticed any repetition, so this looks like the preferable
>> > option...but I am not 100% sure...a possibility would be to add a
>> > pause(0.01) command in the loop (just to be sure), but this is not
>> > fantastic...
>> > - option 1: can potentially lead to repetitions between different runs
>> > of the code
>> >
>> > a global alternative would be to create seed beforehand for each

>> task...
>> >
>> > ok, the code is below...
>> >
>> > %-------------
>> > %Main script
>> > %
>> >
>> > %% Identify a cluster:
>> > parallel.defaultClusterProfile('local');
>> > c = parcluster();
>> >
>> > %% Create a job
>> > j = createJob(c);
>> >
>> > %% Create tasks within a job
>> > %test random number generation
>> > Ntests=6*5;
>> > for jtest=1:Ntests, %create Ntests tasks
>> >
>> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
>> > {[3,1],getfield(rng,'Seed')+jtest}); %option 1: fix the seed from the
>> > main script on the basis of the seed of the client
>> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1,
>> > {[3,1],getfield(rng(rng('shuffle')),'Seed')}); %option 2: generate the
>> > seed using "shuffle" at this moment
>> > %t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],[]});

>> %option
>> > 3: let the function generating the seed internally, using shuffle
>> > t(jtest)=createTask(j, @f_myrand_with_seed, 1, {[3,1],-1}); %option
>> > 4: let the task using the seed it is supposed to use
>> >
>> > end;
>> >
>> > %% Submit the job to the queue
>> > submit(j);
>> >
>> > %% Wait for the job to complete:
>> > wait(j)
>> >
>> > %% Get results
>> > results = fetchOutputs(j);
>> >
>> > %% Delete the job and permanently remove the job from the scheduler's
>> > storage location
>> > delete(j)
>> >
>> > %% Check the output
>> > %if two columns are equal, it means the corresponding tasks started

>> from
>> > the same %random seed...which is something not wanted!
>> > fprintf('\nIf two columns are equal, this is bad...')
>> > final_data=[results{:}]
>> > if any(diff(sort(final_data(1,:)))==0), %checking the first line is
>> > sufficient in this case
>> > fprintf('\n...there is a generation problem!\n');
>> > else
>> > fprintf('\n...this generation seems to be ok!\n');
>> > end;
>> >
>> > %----------------------------
>> >
>> > %---------------------------
>> > %Additional function
>> >
>> > function out=f_myrand_with_seed(dim,sd)
>> >
>> > if nargin>1 && ~isempty(sd),
>> > if sd>0, %change the seed to the required value
>> > rng(sd);
>> > end; %note that, when sd<0 we do NOTHING
>> > else
>> > rng('shuffle'); %use the clock-based seed
>> > end;
>> > out=rand(dim);
>> > %out=rng;out=out.Seed; %use this line to have the seed from the present
>> > task
>> > %-----------------------------
>> >
>> > thanks for your comments...
>> >
>> > bye,
>> > gabriele





Point your RSS reader here for a feed of the latest messages in this topic.

[Privacy Policy] [Terms of Use]

© Drexel University 1994-2014. All Rights Reserved.
The Math Forum is a research and educational enterprise of the Drexel University School of Education.