Sign up ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

Any time that a row ID (oddly placed in column 8, i.e. row[7]) is repeated after the first instance, I want to write those rows into a second file. The code I'm using is extremely slow -- it's a 40-column CSV with about a million rows. This is what I have:

def in_out_gorbsplit(inf, outf1, outf2):
    outf1 = csv.writer(open(outf1, 'wb'), delimiter=',', lineterminator='\n')
    outf2 = csv.writer(open(outf2, 'wb'), delimiter=',', lineterminator='\n')
    inf1 = csv.reader(open(inf, 'rbU'), delimiter=',')
    inf1.next()
    checklist = []
    for row in inf1:
        id_num = str(row[7])
        if id_num not in checklist:
            outf1.writerow(row)
            checklist.append(id_num)
        else:
            outf2.writerow(row)
share|improve this question

1 Answer 1

up vote 2 down vote accepted

Since checklist is a list, a "not in" operation has to iterate over all elements to give the correct answer. In other words, it has a complexity of \$O(n)\$. Use a set() instead, to lower the complexity of the operation to \$O(1)\$, making it much faster.

Also don't forget to close open file handles.

share|improve this answer
1  
This made the slight difference between unoptimized code taking 45 minutes and optimized code taking... 5.5 seconds. I knew something was off! –  Xodarap777 Nov 30 '14 at 10:13

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.