Need fast csv parser for Python to parse 80GB csv file [closed]

Question

I am looking for the fastest way to parse an 80GB csv file with 300 columns in Python.

The csv file does not have backquote commas, e.g a,"blah,blah,blah",c

I have tried Python's built-in csv module which gives me ~50MB/s speed.

with open(file_name) as csvfile:
    reader = csv.reader(csvfile,delimiter=',')
    for row in reader:
        pass

With plain line.split() I get ~55MB/s:

with open(file_name) as csvfile:
    for line in csvfile:
        row = line.split(',')

Without parsing, I get ~400MB/s:

with open(file_name) as csvfile:
    for line in csvfile:
        pass

Using cperf I see that most of time time is used for split() or csv.reader() or re.split(), which I have also used.

Can I make this faster? If a plain read from disk is 400MB/s, I would like to get ~200MB/s while parsing the csv file if possible.

Perhaps one of the libraries mentioned here could help: Fastest Python library to read a CSV file — Janne Karila, Feb 3 at 11:48
You may want to look at whitedb.org - it has CSV import and it has Python bindings. Whether the import is as fast as the database itself - I don't know, still might be worth a shot. — wvxvw, Feb 3 at 12:17
I'm voting to close this question as off-topic because not enough context has been provided, and currently it's a library/algorithm recommendation question. Code Reviews are done over more than just 3 lines of code. Perhaps this should get migrated to StackOverflow instead. Alternatively, Softwarerecs.stackexchange.com, with a library tag. — Pimgd, Feb 3 at 13:58
I have tried pandas briefly. It has tried to load all to memory but failed with memory error when RAM became short. Not yet tried whitedb.org but they say they try to keep all in RAM. So no option for me. — kotlet schabowy, Feb 3 at 20:57

iwlagn · Answer 1 · 2015-02-03 11:00:37Z

up vote 2 down vote

I would suggest you to create a thread pool and do line.split() in separate thread. Playing with thread pool size can give you balance between processing speed and consumed resources.

answered Feb 3 at 11:00

iwlagn
1713

There's the GIL, though. – Janne Karila Feb 3 at 11:38

if GIL would have much influence on such task (and I'm not sure it will) then process pool can be used instead – iwlagn Feb 3 at 13:00

Have you tried pypy ? – Marcin Feb 3 at 15:46

I am using Cython. PyPy would be an option to try. Today I have tried multiprocessing.pool, map_async() but without any performance gain. – kotlet schabowy Feb 3 at 20:54

add a comment |

asked	7 months ago
viewed	321 times
active	7 months ago

current community

your communities

more stack exchange communities

Need fast csv parser for Python to parse 80GB csv file [closed]

closed as too broad by Vogel612, RubberDuck, Pimgd, rolfl Feb 3 at 14:45

1 Answer 1

Not the answer you're looking for? Browse other questions tagged python performance csv or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Need fast csv parser for Python to parse 80GB csv file [closed]

closed as too broad by Vogel612, RubberDuck, Pimgd, rolfl Feb 3 at 14:45

1 Answer 1

Not the answer you're looking for? Browse other questions tagged python performance csv or ask your own question.

Related

Hot Network Questions