Split CSV by Repeated cells python

Question

Any time that a row ID (oddly placed in column 8, i.e. row[7]) is repeated after the first instance, I want to write those rows into a second file. The code I'm using is extremely slow -- it's a 40-column CSV with about a million rows. This is what I have:

def in_out_gorbsplit(inf, outf1, outf2):
    outf1 = csv.writer(open(outf1, 'wb'), delimiter=',', lineterminator='\n')
    outf2 = csv.writer(open(outf2, 'wb'), delimiter=',', lineterminator='\n')
    inf1 = csv.reader(open(inf, 'rbU'), delimiter=',')
    inf1.next()
    checklist = []
    for row in inf1:
        id_num = str(row[7])
        if id_num not in checklist:
            outf1.writerow(row)
            checklist.append(id_num)
        else:
            outf2.writerow(row)

janos · Accepted Answer · 2014-11-30 10:47:30Z

up vote 2 down vote accepted

Since checklist is a list, a "not in" operation has to iterate over all elements to give the correct answer. In other words, it has a complexity of \$O(n)\$. Use a set() instead, to lower the complexity of the operation to \$O(1)\$, making it much faster.

Also don't forget to close open file handles.

edited Nov 30 '14 at 10:47

answered Nov 30 '14 at 8:44

janos♦
59.9k663237

1

This made the slight difference between unoptimized code taking 45 minutes and optimized code taking... 5.5 seconds. I knew something was off! – Xodarap777 Nov 30 '14 at 10:13

add a comment |

asked	9 months ago
viewed	40 times
active	9 months ago

current community

your communities

more stack exchange communities

Split CSV by Repeated cells python

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python python-2.7 csv or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Split CSV by Repeated cells python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python python-2.7 csv or ask your own question.

Related

Hot Network Questions