Take the 2-minute tour ×

Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

Optimize Python script for memory which opens and reads multiple times the same files

up vote 2 down vote favorite

I have code that works perfectly, but it uses too much memory.

Essentially this code takes an input file (lets call it an index, that is 2 column tab-separated) that searches in a second input file (lets call it data, that is 4-column tab separated) for a corresponding term in the 1st column which it then replaces with the information from the index file.

An example of the index is:

amphibian   anm|art|art|art|art
anaconda    anm
aardvark    anm

An example of the data is :

amphibian-n is  green   10
anaconda-n  is  green   2
anaconda-n  eats    mice    1
aardvark-n  eats    plants  1

Thus, when replacing the value in Col 1 of data with the corresponding information from Index, the results are as follows:

anm-n   is  green
art-n   is  green
anm-n   eats    mice
anm-n   eats    plants

I divided the code in steps because the idea is to calculate average of the values given a replaced item (Col 4 in data) of Cols 2 and 3 in the data file. This code takes the total number of slot-fillers in the data file and sums the values which is used in Step 3.

The desired results are the following:

anm second  hello   1.0
anm eats    plants  1.0
anm first   heador  0.333333333333
art first   heador  0.666666666667

I open the same input file many times (i.e. 3 times) in Steps 1, 2 and 3 because I need to create several dictionaries that need to be created in a certain order. However, the bottleneck is definitely between Steps 2 and 3. If I remove the function in Step 2, I can process the entire file (13GB of ram in approx. 30 minutes). However, the necessary addition of Step 2 consumes all memory before beginning Step 3.

Is there a way to optimize how many times I open the same input file?

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import division
from collections import defaultdict

import datetime

print "starting:",
print datetime.datetime.now()

mapping = dict()

with open('input-map', "rb") as oSenseFile:
    for line in oSenseFile:
        uLine = unicode(line, "utf8")
        concept, conceptClass = uLine.split()
        if len(concept) > 2:  
                mapping[concept + '-n'] = conceptClass


print "- step 1:",
print datetime.datetime.now()

lemmas = set()

with open('input-data', "rb") as oIndexFile:
    for line in oIndexFile:
        uLine = unicode(line, "latin1")
        lemma = uLine.split()[0]
        if mapping.has_key(lemma):
            lemmas.add(lemma)

print "- step 2:",
print datetime.datetime.now()


featFreqs = defaultdict(lambda: defaultdict(float))

with open('input-data', "rb") as oIndexFile:            
    for line in oIndexFile:
        uLine = unicode(line, "latin1")
        lemmaTAR, slot, filler, freq = uLine.split()
        featFreqs[slot][filler] += int(freq)


print "- step 3:",
print datetime.datetime.now()

classFreqs = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))

with open('input-data', "rb") as oIndexFile:            
    for line in oIndexFile:
        uLine = unicode(line, "latin1")
        lemmaTAR, slot, filler, freq = uLine.split()
        if lemmaTAR in lemmas:
            senses = mapping[lemmaTAR].split(u'|')
            for sense in senses:
                classFreqs[sense][slot][filler] += (int(freq) / len(senses)) / featFreqs[slot][filler]
        else:
            pass

print "- step 4:",
print datetime.datetime.now()

with open('output', 'wb') as oOutFile:
    for sense in sorted(classFreqs):
                for slot in classFreqs[sense]:
                        for fill in classFreqs[sense][slot]:
                                outstring = '\t'.join([sense, slot, fill,\
                                                       str(classFreqs[sense][slot][fill])])
                                oOutFile.write(outstring.encode("utf8") + '\n')

Any suggestions on how to optimize this code to process large text files (e.g. >4GB)?

edited Mar 1 '14 at 19:20

Jamal♦
23.4k678170

asked Mar 1 '14 at 18:08

owwoow14
1162

Why does anaconda become art in the example? The index maps it to anm. – Janne Karila Mar 2 '14 at 10:20

anaconda does not become art, you are referring to amphibian that is mapped to art. The example demonstrates that for each possible mapping, the Col. inof in Cols 2 and 3 are repeated. – owwoow14 Mar 2 '14 at 12:26

The example is still not quite clear, but anyway, perhaps you should use a database. – Janne Karila Mar 2 '14 at 19:30

I noticed the same question on stackoverflow.com, which has an accepted answer already. – Janne Karila Mar 3 '14 at 13:56

add a comment |

Your Answer

Sign up or log in

Post as a guest

Name

Post as a guest

Name

discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged python optimization memory-management python-2.7 or ask your own question.

question feed

asked	1 year ago
viewed	99 times

current community

your communities

more stack exchange communities

Optimize Python script for memory which opens and reads multiple times the same files

Your Answer

Browse other questions tagged python optimization memory-management python-2.7 or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Optimize Python script for memory which opens and reads multiple times the same files

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged python optimization memory-management python-2.7 or ask your own question.

Related

Hot Network Questions