I have an enormous file (~70GB) with lines that look like this:
$ cat mybigfile.txt
5 7
1 1 0 -2 0 0 2
0 4 0 -4 0 0 4
0 0 1 -1 0 0 0
0 0 0 0 1 0 -1
0 0 0 0 0 1 -1
5 8
-1 -1 -1 -1 -1 1 1 1
0 0 2 0 0 0 -1 -1
3 3 3 -1 -1 -1 -1 -1
-1 -1 -1 0 2 0 0 0
-1 1 -1 0 0 0 1 0
5 7
1 1 0 -2 0 0 5
0 2 0 -2 0 0 2
0 0 1 -1 0 0 0
0 0 0 0 1 0 -4
0 0 0 0 0 1 -4
5 7
1 1 0 -2 0 1 -1
0 2 0 -2 0 0 4
0 0 1 -1 0 0 0
0 0 0 0 1 0 -2
0 0 0 0 0 2 -4
I want to break my enormous file into several less enormous files by organizing each block by the last character in its header. So, running $ python magic.py mybigfile.txt
should produce two new files v07.txt
and v08.txt
$ cat v07.txt
5 7
1 1 0 -2 0 0 2
0 4 0 -4 0 0 4
0 0 1 -1 0 0 0
0 0 0 0 1 0 -1
0 0 0 0 0 1 -1
5 7
1 1 0 -2 0 0 5
0 2 0 -2 0 0 2
0 0 1 -1 0 0 0
0 0 0 0 1 0 -4
0 0 0 0 0 1 -4
5 7
1 1 0 -2 0 1 -1
0 2 0 -2 0 0 4
0 0 1 -1 0 0 0
0 0 0 0 1 0 -2
0 0 0 0 0 2 -4
$ cat v08.txt
5 8
-1 -1 -1 -1 -1 1 1 1
0 0 2 0 0 0 -1 -1
3 3 3 -1 -1 -1 -1 -1
-1 -1 -1 0 2 0 0 0
-1 1 -1 0 0 0 1 0
The headers of each block are all of the form 5 i
with i
ranging from i=6
to i=22
.
Is this sort of thing doable? The only language I'm comfortable enough with to get started is python so I'd prefer a python solution, if possible.
Here is my solution:
from string import whitespace
import sys
class PolyBlock(object):
def __init__(self, lines):
self.lines = lines
def nvertices(self):
return self.lines[0].split()[-1]
def outname(self):
return 'v' + self.nvertices().zfill(2) + '.txt'
def writelines(self):
with open(self.outname(), 'a') as f:
for line in self.lines:
f.write(line)
def __repr__(self):
return ''.join(self.lines)
def genblocks():
with open('5d.txt', 'r') as f:
block = [next(f)]
for line in f:
if line[0] in whitespace:
block.append(line)
else:
yield PolyBlock(block)
block = [line]
def main():
for block in genblocks():
block.writelines()
sys.stdout.write(block.__repr__())
if __name__ == '__main__':
main()
My solutions loops through each block and repeatedly opens and closes the outfiles. I suspect this can be much more efficient but I'm not sure how to improve my code.