Date: Dec 12, 2012 4:13 PM
Author: onzyone@gmail.com
Subject: Re: Problems with DCT waitForState command

On Tuesday, September 16, 2008 3:18:21 AM UTC-4, Edric M Ellis wrote:
> "Eric Solano" <ericssolano@gmail.com> writes:
>

> > I have a script that uses the DCT and calls a job scheduler (MOAB) to schedule
> > several jobs on a cluster. The jobs seems to be scheduled properly and
> > executed by the workers. However, when I try to gather the results for output,
> > the execution seems to be stuck forever at the waitForState command.

>
> Hi Eric,
>
> When things get stuck in "waitForState" for much longer than they should, that
> generally means that execution on the cluster hasn't worked completely
> correctly. In particular, if the state of the job (as far as DCT is concerned)
> never progresses beyond "queued", that is generally an indication that MATLAB on
> the cluster either hasn't been launched successfully, or it cannot write to the
> files in your DataLocation. (By the way, I assume that all jobs fail in this
> way, but that "qstat" indicates that they've completed)
>
> Are you using the example integration scripts for PBS/Torque? Do you have the
> output files created? If so, they may shed some light on why things aren't
> completing correctly.
>
> (The usual problems with using the example integration scripts are either that
> those scripts aren't on the default MATLAB path of the workers, or
> ClusterMatlabRoot isn't set correctly, or the workers cannot access the
> DataLocation).
>
> Cheers,
>
> Edric.


Hello Edric,

I have run into this issue as well on with the same setup: moab / torque. I have confirmed that the jobs have run successfully in the cluster but the client hung. (tested with 1000 distributed jobs) This may be the way that MATLAB has implemented their "checkjob" code

Let me know what you think

Thanks,
Jason.