Drexel dragonThe Math ForumDonate to the Math Forum



Search All of the Math Forum:

Views expressed in these public forums are not endorsed by Drexel University or The Math Forum.


Math Forum » Discussions » Software » comp.soft-sys.math.mathematica

Topic: Fast selection of lots of elements from a large list
Replies: 1   Last Post: Sep 29, 2012 3:10 AM

Advanced Search

Back to Topic List Back to Topic List Jump to Tree View Jump to Tree View  
Sseziwa Mukasa

Posts: 108
Registered: 8/26/07
Re: Fast selection of lots of elements from a large list
Posted: Sep 29, 2012 3:10 AM
  Click to see the message monospaced in plain text Plain Text   Click to reply to this topic Reply

You can extract the rows a little more quickly if you sort them first and take advantage of the fact that they are unique. My timings are an order of magnitude faster because I'm using integers instead of strings for row IDs. If you could map your strings to integers you may see a similar performance gain.

Anyway here's my example code:

(Debug) In[19]:= rowIds = Range[600000];
q = RandomSample[rowIds, 1000];
Timing[Map[Position[rowIds, #] &, q]][[1]]
Timing[Block[{result = {}, qSorted = Sort[q], index = 1},
Do[If[rowIds[[i]] == qSorted[[index]], result = {result, rowIds[[i]]};
index++]; If[index > Length[q], Break[]], {i, Length[rowIds]}];
Flatten[result]]][[1]]
(Debug) Out[21]= 5.71901
(Debug) Out[22]= 3.44422

Again note that this is not an apple to apple's comparison. The second expression extracts the actual row not just its position.

Regards,
Sseziwa

On Sep 27, 2012, at 10:48 PM, Mark Coleman wrote:

> Greetings,
>
> I've been using Mathematica to perform cluster analysis on a data set with about 600,000 rows and 60 columns. I've had the FindCluster procedure return a unique row identifier (12 character string) rather than the clustered data because I want to "join" these results to another data set for further analysis. To accomplish this I've been using the Position function to identify the element numbers in each cluster.
>
> To give a specific example, my cluster analysis identifiers twevle clusters on my original data set. The first of these clusters contains about 15,000 row identifiers. The extract the corresponding data from other data sets, I find the position of each identifier in my original data set using the simple code
>
> q=clusterResults[[1]]; (* row id's for first cluster *)
> p=Map[Position[rowIDs,#]&,q];
>
> where, "rowIDs" are the first column from the other dataset that contain the same string identifiers (rowIDs has about 600,000 sublists). I then Extract these elements ("rows") from the data set and continue my analysis.
>
> Unfortunately this is quite slow. Doing this on a sample of 1000 elements requires 340 seconds on my desktop computer. Some of my clusters have many tens of thousands of elements. I'm hoping someone can suggest a faster way of doing this.
>
> Thanks,
>
> Mark
>
>






Point your RSS reader here for a feed of the latest messages in this topic.

[Privacy Policy] [Terms of Use]

© Drexel University 1994-2014. All Rights Reserved.
The Math Forum is a research and educational enterprise of the Drexel University School of Education.