Join the Stack Overflow Community
Stack Overflow is a community of 6.5 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

I am reading two large csv files which each have 315,000 rows and 300 columns. I was hoping to read all of this in using python but am now running into memory issues at around 50,000 rows. I have around 4GB of RAM and each csv file is 1.5 Gig's. I was going to try amazon's webservices but iF anyone has suggestions on optimization techniques of reading in the files, I'd love to save the money!

sample data of first 2/314,000 rows here: https://drive.google.com/file/d/0B0MhJ7rn5OujR19LLVYyUFF5MVE/edit?usp=sharing

I get the following errors in my Python (xy) Spyder console :

for row in getstuff(filename): (line 97)
for row in getdata("test.csv"): (line 89)
MemoryError

I have also tried to do the following as a comment suggested and still receive a memory error:

for row in getdata("train.csv"):                        
   data.append(row[0::])

np.array(data)

Code below:

import csv
from xlrd import open_workbook 
from xlutils.copy import copy 
import numpy as np
import time
from sklearn.ensemble import RandomForestClassifier
from numpy import savetxt
from sklearn.feature_extraction import DictVectorizer
from xlwt import *


t0=time.clock()
data=[]
data1=[]

count=0
print "Initializing..."

def getstuff(filename):
  with open(filename, "rb") as csvfile:
    datareader = csv.reader(csvfile)
    count = 0
    for row in datareader:
        if count<100000:
            yield row
            count += 1
        elif count > 100000:
            return
        else:
            return

def getdata(filename):
  for row in getstuff(filename):
    yield row


for row in getdata("train.csv"):
   np.array(data.append(row[0::]))


for row in getdata("test.csv"): 
   np.array(data1.append(row[0::]))


target = np.array([x[1] for x in data],dtype=object)
train = np.array([x[2:] for x in data],dtype=object)    
test = np.array([x[1:] for x in data1],dtype=object)    
share|improve this question
1  
Any reason you are not reading the csv with numpy.genfromtxt? – Steven Rumbalski Aug 6 '14 at 4:04
1  
By writing np.array(data.append()), it probably does not convert data to an np.array.Maybe you can try converting data just once outside the loop. – seb Aug 6 '14 at 4:09
    
Mr Rumbalski, thank you for your response. I tried genfromtxt initially and ran into similar issues but will try once again. – user2476810 Aug 6 '14 at 12:52
    
seb, I tried to convert to np.array(data) outside of the for loop and received a memory error again – user2476810 Aug 6 '14 at 12:57
    
Is your data numeric? – Steven Rumbalski Aug 6 '14 at 14:36

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.