Optimizing python code for big data sets

Question

I'm trying to optimize a python string that works on big data sets, the way it works is by taking in a with a list of keywords and scores and taking in a file loaded with data from the twitter api. The program does a keyword match against tweet text. At the end of the program I want to produce an average for each term found in text object of the json file. e.g.

sad 3

With sad being the keyword and 3 being the average score.

It's running way too slow but I'm new to Python coming from a php background and I think I'm doing things the php way in python.

How can I get this code to run faster?

import sys
import json
import re

def findRecord(key, records):
    for r in records:
        if r[0] == key:
            return r

def average_records(records):
    for r in records:
        if r[1] > 0:
            avg = r[1] / r[2]
            print r[0] + ' ' + str(avg)
        else:
            avg = r[3] / r[4]
            print r[0] + ' ' + str(avg)

def hw(sent_file, tweet_file):
    scores = {}

    sent_file = open(sent_file, 'r')

    for line in sent_file:
        term, score = line.split("\t")
        scores[term] = int(score)

    recored_affin = []

    #print scores.items()

    data = []

    with open(tweet_file, 'r') as f:
        for line in f:
            data.append(json.loads(line))

        #print data[4]['text']

        for tweet in data:
            total = 0
            if 'text' in tweet:
                for k, v in scores.iteritems():

                    #print tweet['text']
                    num_of_aff = len(re.findall(k, tweet['text']))
                    if num_of_aff > 0:
                        #print "Number is: " + str(num_of_aff)
                        #print "Word is: " + k
                        #print "Tweet is: " + tweet['text']
                        total += (v * num_of_aff)
                        #print "Score is: " + str(total)

                        #while count < len(recorded_affin):

                        foundRow = findRecord(k, recored_affin)

                        if foundRow != None:
                            index = recored_affin.index(foundRow)
                            quick_rec = recored_affin[index]

                            if v > 0:
                                new_value = quick_rec[1] + v
                                new_count = quick_rec[2] + 1
                                old_neg_value = 0
                                old_neg_count = 0
                                recored_affin.append([k, new_value, new_count, old_neg_value, old_neg_count])
                                recored_affin.remove(foundRow)
                            elif v < 0:

                                old_pos_value = 0
                                old_pos_count = 0
                                new_value = quick_rec[3] + v
                                new_count = quick_rec[4] + 1
                                recored_affin.append([k, old_pos_value, old_pos_count, new_value, new_count])
                                recored_affin.remove(foundRow)

                        else:
                            if v > 0:
                                recored_affin.append([k,v,1,0,0])
                            elif v < 0:
                                recored_affin.append([k,0,0,v,1])

                            #print recored_affin

                        ##print foundRow


                ##print total
    average_records(recored_affin)



def lines(fp):
    print str(len(fp.readlines()))

def main():
    sent_file = open(sys.argv[1])
    tweet_file = open(sys.argv[2])
    hw(sys.argv[1], sys.argv[2])
    #lines(sent_file)
    #lines(tweet_file)

if __name__ == '__main__':
    main()

Please provide some example data. Without being able to run your program, it's hard to tell whether we've made it any faster.

Azd325 · Answer 1 · 2013-05-12 15:52:14Z

Your code contains a mix of tabs and spaces. This caused your code to display incorrectly before I edited it. The most common way in python is to use only spaces. You should be able to configure your editor to insert spaces instead of tabs when you push the tab key.

import sys
import json
import re

def findRecord(key, records):

Python convention is to name function lowercase_with_underscores

    for r in records:
        if r[0] == key:
            return r

It is going to be inefficient to loop over records looking for things like this. Instead, you use a dictionary and look them up by key.

def average_records(records):
    for r in records:

Rather than indexing r all over the place, I suggest using:

    for k, new_value, new_count, old_neg_value, old_neg_count in records:

Then you can access those names directly. It'll be easier to read and probably marginally faster.

        if r[1] > 0:
            avg = r[1] / r[2]
            print r[0] + ' ' + str(avg)
        else:
            avg = r[3] / r[4]
            print r[0] + ' ' + str(avg)

In this case, you can do: print r[0], avg for the same result.

def hw(sent_file, tweet_file):

I have no idea what hw means

    scores = {}

    sent_file = open(sent_file, 'r')

    for line in sent_file:
        term, score = line.split("\t")
        scores[term] = int(score)

This scores bit is a nicely self contained section. I suggest making it a separate function.

    recored_affin = []

    #print scores.items()

Don't keep dead code, just remove it. If you think you might need it back look into version control.

    data = []

    with open(tweet_file, 'r') as f:
        for line in f:
            data.append(json.loads(line))

        #print data[4]['text']

You're finished with the file now, you should really drop out of the with block.

        for tweet in data:

There isn't really much point in storing the json objects in a list just to process them. Just process them as you get them.

            total = 0
            if 'text' in tweet:
                for k, v in scores.iteritems():

                    #print tweet['text']
                    num_of_aff = len(re.findall(k, tweet['text']))
                    if num_of_aff > 0:
                        #print "Number is: " + str(num_of_aff)
                        #print "Word is: " + k
                        #print "Tweet is: " + tweet['text']
                        total += (v * num_of_aff)
                        #print "Score is: " + str(total)

                        #while count < len(recorded_affin):

Don't leave commented code in there

                        foundRow = findRecord(k, recored_affin)

                        if foundRow != None:

Use is None to check for foundRow

                            index = recored_affin.index(foundRow)

That's going to be expensive, it scans through the whole list again.

                            quick_rec = recored_affin[index]

Isn't this just foundRow again?

                            if v > 0:
                                new_value = quick_rec[1] + v
                                new_count = quick_rec[2] + 1
                                old_neg_value = 0
                                old_neg_count = 0
                                recored_affin.append([k, new_value, new_count, old_neg_value, old_neg_count])
                                recored_affin.remove(foundRow)

Expensive, has to scan through again. elif v < 0:

                                old_pos_value = 0
                                old_pos_count = 0
                                new_value = quick_rec[3] + v
                                new_count = quick_rec[4] + 1
                                recored_affin.append([k, old_pos_value, old_pos_count, new_value, new_count])
                                recored_affin.remove(foundRow)

You've got some duplication here, you should move the common logic out of the if blocks.

                        else:
                            if v > 0:
                                recored_affin.append([k,v,1,0,0])
                            elif v < 0:
                                recored_affin.append([k,0,0,v,1])

                            #print recored_affin

                        ##print foundRow


                ##print total
    average_records(recored_affin)



def lines(fp):
    print str(len(fp.readlines()))

def main():
    sent_file = open(sys.argv[1])
    tweet_file = open(sys.argv[2])

Why do you open these file but never do anything with them?

    hw(sys.argv[1], sys.argv[2])
    #lines(sent_file)
    #lines(tweet_file)

if __name__ == '__main__':
    main()

Your speed issues are probably the result of using a list and constantly searching over the whole list instead of using a dictionary. Make recorded_affin a dictionary, and your code should be simpler and faster.

asked	1 month ago
viewed	176 times
active	1 month ago

Optimizing python code for big data sets

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged python optimization twitter or ask your own question.

Community Bulletin

Optimizing python code for big data sets

1 Answer

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python optimization twitter or ask your own question.

Community Bulletin

Related