I am reading two large csv files which each have 315,000 rows and 300 columns. I was hoping to read all of this in using python but am now running into memory issues at around 50,000 rows. I have around 4GB of RAM and each csv file is 1.5 Gig's. I was going to try amazon's webservices but iF anyone has suggestions on optimization techniques of reading in the files, I'd love to save the money!
sample data of first 2/314,000 rows here: https://drive.google.com/file/d/0B0MhJ7rn5OujR19LLVYyUFF5MVE/edit?usp=sharing
I get the following errors in my Python (xy) Spyder console :
for row in getstuff(filename): (line 97)
for row in getdata("test.csv"): (line 89)
MemoryError
I have also tried to do the following as a comment suggested and still receive a memory error:
for row in getdata("train.csv"):
data.append(row[0::])
np.array(data)
Code below:
import csv
from xlrd import open_workbook
from xlutils.copy import copy
import numpy as np
import time
from sklearn.ensemble import RandomForestClassifier
from numpy import savetxt
from sklearn.feature_extraction import DictVectorizer
from xlwt import *
t0=time.clock()
data=[]
data1=[]
count=0
print "Initializing..."
def getstuff(filename):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
count = 0
for row in datareader:
if count<100000:
yield row
count += 1
elif count > 100000:
return
else:
return
def getdata(filename):
for row in getstuff(filename):
yield row
for row in getdata("train.csv"):
np.array(data.append(row[0::]))
for row in getdata("test.csv"):
np.array(data1.append(row[0::]))
target = np.array([x[1] for x in data],dtype=object)
train = np.array([x[2:] for x in data],dtype=object)
test = np.array([x[1:] for x in data1],dtype=object)
numpy.genfromtxt
? – Steven Rumbalski Aug 6 '14 at 4:04np.array(data.append())
, it probably does not convertdata
to annp.array
.Maybe you can try convertingdata
just once outside the loop. – seb Aug 6 '14 at 4:09