Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I have an ASCII data file with a format that's unfamiliar to me in terms of how I could best read the data into a list or array in Python. The ASCII data file is formatted like this:

line 0:          <month> <year>
lines 1 - 217:   12 integer values per line, each value has seven spaces, the first is always a space

For example the first record in the file looks like this:

    1 1900
 -32768 -32768    790  -1457  -1367    -16   -575    116 -32768 -32768   1898 -32768
 -32768  -1289 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768 -32768
 -32768 -32768    -92 -32768 -32768 -32768    125 -32768 -32768 -32768 -32768 -32768
 -32768 -32768 -32768 -32768 -32768  -1656 -32768   -764 -32768 -32768 -32768 -32768
 <212 more lines like the above for this record, same spacing/separators/etc.>

I'll call the above a single record (all data for a single month), and there are about 1200 records in the file. The months increase sequentially from 1 to 12 before starting over with an increment of the year value. I want to read the records one at a time, something like this:

with open(data_file, 'r') as dataFile:
    # while file still has unread records
        # read month and year to use to create a datetime object
        # read the next 216 lines of 12 values into a list (or array) of 2592 values
        # process the record's list (or array) of data

Can someone suggest an efficient "Pythonic" way of doing the above looping over the records including how to best read the data into a list or array?

Thanks in advance for your help!

share|improve this question

2 Answers 2

up vote 1 down vote accepted

itertools.groupby can be used here.

from datetime import date
from itertools import groupby

def keyfunc(line):
    global key
    row = map(int, line.strip().split())
    if len(row) == 2:
        month, year = row
        key = date(year, month, 1)
    return key

def read_file(fname):
    with open(fname, 'r') as f:
        for rec_date, lines in groupby(f, keyfunc):
            data = []
            for line in lines:
                line = map(int, line.strip().split())
                if len(line) == 2:
                    continue
                data.extend(line)
            yield rec_date, data

for rec_date, data in read_file('data.txt'):
    print rec_date, data[:5], '... (', len(data), ")"

The keyfunc is the clever bit. It returns the key for each row of data. groupby will produce an iterator for each set of contiguous records with the same key. keyfunc is implemented using a global to track the latest 2-value record (converted to a date). This global might be avoidable with a bit more thought. When a new 2-value record is found it starts a new group with the date as the key. The data are aggregated into a single array for each key, ignoring the 2-value rows as they are also returned. The final result is an iterator that returns a 2-tuple of date and data array for each date in your data file.

EDIT: Here's a simple option, without using itertools.groupby

from datetime import date

def read_file2(fname):
    data = []
    with open(fname, 'r') as f:
        for line in f:
            row = map(int, line.strip().split())
            if len(row) == 2:
                if data:
                    yield key, data
                month, year = row
                key = date(year, month, 1)                
                data = []
            else:
                data.extend(row)
        if data:
            yield key, data


for rec_date, data in read_file2('data.txt'):
    print rec_date, data[:5], '... (', len(data), ")"
share|improve this answer
1  
This was a nice opportunity for me to try using itertools.groupby. Its probably also possible to aggregate the data with a simple iterator and implement the grouping logic yourself. –  Graeme Stuart Sep 11 '13 at 23:37
    
I'm now using your code as part of my program, it works like a charm. Thanks so much for your help and for helping me learn a bit more about how to do things well in Python! –  James Adams Sep 12 '13 at 19:56

you could try building your numpy array with a generator function something like:

import numpy
def read_input(input_file):
    line_count = 0
    format_line = lambda x : [float(i) for i in x.split()]

    for line in open(input_file):
        if line_count <= 216:
            yield format_line(line)
        else:
            break
        line_count += 1

data = numpy.array([i for i in read_input(input_file)])

This will return the (month, year) and first 216 records as per your question.

share|improve this answer
1  
This will only read the first set of data. Also, it doesn't distinguish between the date rows and the data rows. Finally, its not very pythonic, I would use enumerate if I needed to track the line_count. See my answer for a more comprehensive approach. –  Graeme Stuart Sep 11 '13 at 23:34

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.