I am looking for the fastest way to parse an 80GB csv file with 300 columns in Python.
The csv file does not have backquote commas, e.g a,"blah,blah,blah",c
I have tried Python's built-in csv module which gives me ~50MB/s speed.
with open(file_name) as csvfile:
reader = csv.reader(csvfile,delimiter=',')
for row in reader:
pass
With plain line.split()
I get ~55MB/s:
with open(file_name) as csvfile:
for line in csvfile:
row = line.split(',')
Without parsing, I get ~400MB/s:
with open(file_name) as csvfile:
for line in csvfile:
pass
Using cperf I see that most of time time is used for split()
or csv.reader()
or re.split()
, which I have also used.
Can I make this faster? If a plain read from disk is 400MB/s, I would like to get ~200MB/s while parsing the csv file if possible.