I have code that works perfectly, but it uses too much memory.
Essentially this code takes an input file (lets call it an index, that is 2 column tab-separated) that searches in a second input file (lets call it data, that is 4-column tab separated) for a corresponding term in the 1st column which it then replaces with the information from the index file.
An example of the index is:
amphibian anm|art|art|art|art
anaconda anm
aardvark anm
An example of the data is :
amphibian-n is green 10
anaconda-n is green 2
anaconda-n eats mice 1
aardvark-n eats plants 1
Thus, when replacing the value in Col 1 of data with the corresponding information from Index, the results are as follows:
anm-n is green
art-n is green
anm-n eats mice
anm-n eats plants
I divided the code in steps because the idea is to calculate average of the values given a replaced item (Col 4 in data) of Cols 2 and 3 in the data file. This code takes the total number of slot-fillers in the data file and sums the values which is used in Step 3.
The desired results are the following:
anm second hello 1.0
anm eats plants 1.0
anm first heador 0.333333333333
art first heador 0.666666666667
I open the same input file many times (i.e. 3 times) in Steps 1, 2 and 3 because I need to create several dictionaries that need to be created in a certain order. However, the bottleneck is definitely between Steps 2 and 3. If I remove the function in Step 2, I can process the entire file (13GB of ram in approx. 30 minutes). However, the necessary addition of Step 2 consumes all memory before beginning Step 3.
Is there a way to optimize how many times I open the same input file?
#!/usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import division
from collections import defaultdict
import datetime
print "starting:",
print datetime.datetime.now()
mapping = dict()
with open('input-map', "rb") as oSenseFile:
for line in oSenseFile:
uLine = unicode(line, "utf8")
concept, conceptClass = uLine.split()
if len(concept) > 2:
mapping[concept + '-n'] = conceptClass
print "- step 1:",
print datetime.datetime.now()
lemmas = set()
with open('input-data', "rb") as oIndexFile:
for line in oIndexFile:
uLine = unicode(line, "latin1")
lemma = uLine.split()[0]
if mapping.has_key(lemma):
lemmas.add(lemma)
print "- step 2:",
print datetime.datetime.now()
featFreqs = defaultdict(lambda: defaultdict(float))
with open('input-data', "rb") as oIndexFile:
for line in oIndexFile:
uLine = unicode(line, "latin1")
lemmaTAR, slot, filler, freq = uLine.split()
featFreqs[slot][filler] += int(freq)
print "- step 3:",
print datetime.datetime.now()
classFreqs = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))
with open('input-data', "rb") as oIndexFile:
for line in oIndexFile:
uLine = unicode(line, "latin1")
lemmaTAR, slot, filler, freq = uLine.split()
if lemmaTAR in lemmas:
senses = mapping[lemmaTAR].split(u'|')
for sense in senses:
classFreqs[sense][slot][filler] += (int(freq) / len(senses)) / featFreqs[slot][filler]
else:
pass
print "- step 4:",
print datetime.datetime.now()
with open('output', 'wb') as oOutFile:
for sense in sorted(classFreqs):
for slot in classFreqs[sense]:
for fill in classFreqs[sense][slot]:
outstring = '\t'.join([sense, slot, fill,\
str(classFreqs[sense][slot][fill])])
oOutFile.write(outstring.encode("utf8") + '\n')
Any suggestions on how to optimize this code to process large text files (e.g. >4GB)?
anaconda
becomeart
in the example? The index maps it toanm
. – Janne Karila Mar 2 '14 at 10:20anaconda
does not becomeart
, you are referring toamphibian
that is mapped toart
. The example demonstrates that for each possible mapping, the Col. inof in Cols 2 and 3 are repeated. – owwoow14 Mar 2 '14 at 12:26