Is there an efficient way to parse blocks of text in python?

Question

I have an enormous file (~70GB) with lines that look like this:

$ cat mybigfile.txt
5 7  
    1    1    0   -2    0    0    2
    0    4    0   -4    0    0    4
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -1
    0    0    0    0    0    1   -1
5 8  
   -1   -1   -1   -1   -1    1    1    1
    0    0    2    0    0    0   -1   -1
    3    3    3   -1   -1   -1   -1   -1
   -1   -1   -1    0    2    0    0    0
   -1    1   -1    0    0    0    1    0
5 7  
    1    1    0   -2    0    0    5
    0    2    0   -2    0    0    2
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -4
    0    0    0    0    0    1   -4
5 7  
    1    1    0   -2    0    1   -1
    0    2    0   -2    0    0    4
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -2
    0    0    0    0    0    2   -4

I want to break my enormous file into several less enormous files by organizing each block by the last character in its header. So, running $ python magic.py mybigfile.txt should produce two new files v07.txt and v08.txt

$ cat v07.txt
5 7  
    1    1    0   -2    0    0    2
    0    4    0   -4    0    0    4
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -1
    0    0    0    0    0    1   -1
5 7  
    1    1    0   -2    0    0    5
    0    2    0   -2    0    0    2
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -4
    0    0    0    0    0    1   -4
5 7  
    1    1    0   -2    0    1   -1
    0    2    0   -2    0    0    4
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -2
    0    0    0    0    0    2   -4

$ cat v08.txt
5 8  
   -1   -1   -1   -1   -1    1    1    1
    0    0    2    0    0    0   -1   -1
    3    3    3   -1   -1   -1   -1   -1
   -1   -1   -1    0    2    0    0    0
   -1    1   -1    0    0    0    1    0

The headers of each block are all of the form 5 i with i ranging from i=6 to i=22.

Is this sort of thing doable? The only language I'm comfortable enough with to get started is python so I'd prefer a python solution, if possible.

Here is my solution:

from string import whitespace
import sys


class PolyBlock(object):

    def __init__(self, lines):
        self.lines = lines

    def nvertices(self):
        return self.lines[0].split()[-1]

    def outname(self):
        return 'v' + self.nvertices().zfill(2) + '.txt'

    def writelines(self):
        with open(self.outname(), 'a') as f:
            for line in self.lines:
                f.write(line)

    def __repr__(self):
        return ''.join(self.lines)


def genblocks():
    with open('5d.txt', 'r') as f:
        block = [next(f)]
        for line in f:
            if line[0] in whitespace:
                block.append(line)
            else:
                yield PolyBlock(block)
                block = [line]


def main():
    for block in genblocks():
        block.writelines()
        sys.stdout.write(block.__repr__())


if __name__ == '__main__':
    main()

My solutions loops through each block and repeatedly opens and closes the outfiles. I suspect this can be much more efficient but I'm not sure how to improve my code.

Well of course it's doable. This might be a better question to ask over at SO when you've made an attempt at the code. — Hydranix, 20 hours ago
I have a working solution that involves opening and closing an outfile for each block but I'm sure it's not the most efficient solution. — Brian Fitzpatrick, 20 hours ago
That would be an excellent question for SO, there are quite a few python gurus over there. I myself am only familiar with C/C++. — Hydranix, 20 hours ago
I guess I'm open to other solutions if they get the job done! — Brian Fitzpatrick, 20 hours ago

Kamaraj · Accepted Answer · 2017-02-10 04:09:57Z

up vote 4 down vote accepted

if you are ok with awk command, then try this...

awk 'NF==2{filename="v0"$2".txt"}{print > filename}' mybigfile.txt

answered 20 hours ago

Kamaraj

1,78519

1

This is unbelievably cool. Maybe I should start learning some other languages! – Brian Fitzpatrick 20 hours ago

add a comment |

asked	today
viewed	59 times
active	today

current community

your communities

more stack exchange communities

Is there an efficient way to parse blocks of text in python?

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python large-files or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Is there an efficient way to parse blocks of text in python?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python large-files or ask your own question.

Related

Hot Network Questions