Drexel dragonThe Math ForumDonate to the Math Forum



Search All of the Math Forum:

Views expressed in these public forums are not endorsed by Drexel University or The Math Forum.


Math Forum » Discussions » Software » comp.soft-sys.math.mathematica

Topic: Obtaining Random LIne from A file
Replies: 9   Last Post: Feb 21, 2013 5:46 AM

Advanced Search

Back to Topic List Back to Topic List Jump to Tree View Jump to Tree View   Messages: [ Previous | Next ]
David Bailey

Posts: 712
Registered: 11/7/08
Re: Obtaining Random LIne from A file
Posted: Feb 19, 2013 6:52 PM
  Click to see the message monospaced in plain text Plain Text   Click to reply to this topic Reply

On 19/02/2013 06:09, Ramiro wrote:
> Thank you so much for the reply. My files are 50MB each, I don't think ReadList would work for my purposes, it would be too slow. I am actually doing an MCMC simulation, doing (hopefully if I have time) millions of iterations and in each one I need to read a random line from one of many files, thus requiring this reading to happen as quickly as possible. Any suggestions? Each line is pretty much the same length.
>
> Thanks,
> Ramiro
>


OK - let's establish two points:

1) Are the records in the files of a fixed length?

2) When you say you want an 'arbitrary line' I am assuming that you
calculate a number N, and when want the N'th line of the file. If you
really don't care which line you choose, use Ramiro's method (above).

If your files are not guaranteed to have equal length records, there is
obviously a problem, as I explained before, because you have to read all
N-1 lines to establish which is the N'th. One option therefore, might be
to pre-process your files to make fixed length records by padding with
blanks.

Once you have fixed record length files, you can open them with
BinaryFormat->True and use SetStreamPosition to set the stream to the
position in bytes where your record starts, and read the relevant number
of bytes. Unless you are using extended characters, you could convert
these to characters with FromCharacterCode.

This should be VERY fast, because the cost of each access is not
proportional to the size of the file (once all the files have been
preprocessed).

If the records are variable length but contain some identification such
as a line number, another option would be to pull out a line as Ramiro
suggested, but then use a binary chop procedure to zero in on the line
of interest.

Hint: You may want to look at the processed file with a hex editor, to
make sure the record length is as you expect - remember Windows uses 2
characters per end of line!

David Bailey
http://www.dbaileyconsultancy.co.uk





Point your RSS reader here for a feed of the latest messages in this topic.

[Privacy Policy] [Terms of Use]

© Drexel University 1994-2014. All Rights Reserved.
The Math Forum is a research and educational enterprise of the Drexel University School of Education.