Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I have an enormous file (~70GB) with lines that look like this:

$ cat mybigfile.txt
5 7  
    1    1    0   -2    0    0    2
    0    4    0   -4    0    0    4
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -1
    0    0    0    0    0    1   -1
5 8  
   -1   -1   -1   -1   -1    1    1    1
    0    0    2    0    0    0   -1   -1
    3    3    3   -1   -1   -1   -1   -1
   -1   -1   -1    0    2    0    0    0
   -1    1   -1    0    0    0    1    0
5 7  
    1    1    0   -2    0    0    5
    0    2    0   -2    0    0    2
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -4
    0    0    0    0    0    1   -4
5 7  
    1    1    0   -2    0    1   -1
    0    2    0   -2    0    0    4
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -2
    0    0    0    0    0    2   -4

I want to break my enormous file into several less enormous files by organizing each block by the last character in its header. So, running $ python magic.py mybigfile.txt should produce two new files v07.txt and v08.txt

$ cat v07.txt
5 7  
    1    1    0   -2    0    0    2
    0    4    0   -4    0    0    4
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -1
    0    0    0    0    0    1   -1
5 7  
    1    1    0   -2    0    0    5
    0    2    0   -2    0    0    2
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -4
    0    0    0    0    0    1   -4
5 7  
    1    1    0   -2    0    1   -1
    0    2    0   -2    0    0    4
    0    0    1   -1    0    0    0
    0    0    0    0    1    0   -2
    0    0    0    0    0    2   -4

$ cat v08.txt
5 8  
   -1   -1   -1   -1   -1    1    1    1
    0    0    2    0    0    0   -1   -1
    3    3    3   -1   -1   -1   -1   -1
   -1   -1   -1    0    2    0    0    0
   -1    1   -1    0    0    0    1    0

The headers of each block are all of the form 5 i with i ranging from i=6 to i=22.

Is this sort of thing doable? The only language I'm comfortable enough with to get started is python so I'd prefer a python solution, if possible.

Here is my solution:

from string import whitespace
import sys


class PolyBlock(object):

    def __init__(self, lines):
        self.lines = lines

    def nvertices(self):
        return self.lines[0].split()[-1]

    def outname(self):
        return 'v' + self.nvertices().zfill(2) + '.txt'

    def writelines(self):
        with open(self.outname(), 'a') as f:
            for line in self.lines:
                f.write(line)

    def __repr__(self):
        return ''.join(self.lines)


def genblocks():
    with open('5d.txt', 'r') as f:
        block = [next(f)]
        for line in f:
            if line[0] in whitespace:
                block.append(line)
            else:
                yield PolyBlock(block)
                block = [line]


def main():
    for block in genblocks():
        block.writelines()
        sys.stdout.write(block.__repr__())


if __name__ == '__main__':
    main()

My solutions loops through each block and repeatedly opens and closes the outfiles. I suspect this can be much more efficient but I'm not sure how to improve my code.

share|improve this question
    
Well of course it's doable. This might be a better question to ask over at SO when you've made an attempt at the code. – Hydranix 20 hours ago
    
I have a working solution that involves opening and closing an outfile for each block but I'm sure it's not the most efficient solution. – Brian Fitzpatrick 20 hours ago
    
That would be an excellent question for SO, there are quite a few python gurus over there. I myself am only familiar with C/C++. – Hydranix 20 hours ago
    
I guess I'm open to other solutions if they get the job done! – Brian Fitzpatrick 20 hours ago
up vote 4 down vote accepted

if you are ok with awk command, then try this...

awk 'NF==2{filename="v0"$2".txt"}{print > filename}' mybigfile.txt
share|improve this answer
1  
This is unbelievably cool. Maybe I should start learning some other languages! – Brian Fitzpatrick 20 hours ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.