I'm parsing a document and writing to disk pairs such as these ones: 0 vs 1, true 0 vs 2, false 0 vs 3, true 1 vs 2, true 1 vs 3, false .. and so on.
Successively i'm balancing the trues and falses rows for each instance, by removing random lines (lines with true value if they exceed, and viceversa) and I end up with a file such as this one: 0 vs 1 true 0 vs 2 false 1 vs 2 true 1 vs 3 true 1 vs 4 false 1 vs 5 false
The falses are usually much much more than trues, so in the previous example, I could keep only 1 false for isntance 0, and only 2 falses for instance 1.
I'm doing this process in 2 steps, before parsing and then balancing.
Now, my issue is that the unbalanced file is too big: more than 1GB, and most of its rows are going to be removed by the balancing step.
My question is: can I balance the rows while parsing ?
My guess is no, because I don't know which items are arriving and I can't delete any row until when all rows for a specific instance have been discovered.