x Dismiss

Join the Stack Overflow Community

Stack Overflow is a community of 6.5 million programmers, just like you, helping each other.
Join them; it only takes a minute:

Python memory error with numpy and csv reader

up vote 2 down vote favorite

I am reading two large csv files which each have 315,000 rows and 300 columns. I was hoping to read all of this in using python but am now running into memory issues at around 50,000 rows. I have around 4GB of RAM and each csv file is 1.5 Gig's. I was going to try amazon's webservices but iF anyone has suggestions on optimization techniques of reading in the files, I'd love to save the money!

sample data of first 2/314,000 rows here: https://drive.google.com/file/d/0B0MhJ7rn5OujR19LLVYyUFF5MVE/edit?usp=sharing

I get the following errors in my Python (xy) Spyder console :

for row in getstuff(filename): (line 97)
for row in getdata("test.csv"): (line 89)
MemoryError

I have also tried to do the following as a comment suggested and still receive a memory error:

for row in getdata("train.csv"):                        
   data.append(row[0::])

np.array(data)

Code below:

import csv
from xlrd import open_workbook 
from xlutils.copy import copy 
import numpy as np
import time
from sklearn.ensemble import RandomForestClassifier
from numpy import savetxt
from sklearn.feature_extraction import DictVectorizer
from xlwt import *


t0=time.clock()
data=[]
data1=[]

count=0
print "Initializing..."

def getstuff(filename):
  with open(filename, "rb") as csvfile:
    datareader = csv.reader(csvfile)
    count = 0
    for row in datareader:
        if count<100000:
            yield row
            count += 1
        elif count > 100000:
            return
        else:
            return

def getdata(filename):
  for row in getstuff(filename):
    yield row


for row in getdata("train.csv"):
   np.array(data.append(row[0::]))


for row in getdata("test.csv"): 
   np.array(data1.append(row[0::]))


target = np.array([x[1] for x in data],dtype=object)
train = np.array([x[2:] for x in data],dtype=object)    
test = np.array([x[1:] for x in data1],dtype=object)

edited Aug 7 '14 at 1:51

asked Aug 6 '14 at 3:50

user2476810

142

Any reason you are not reading the csv with numpy.genfromtxt? – Steven Rumbalski Aug 6 '14 at 4:04

By writing np.array(data.append()), it probably does not convert data to an np.array.Maybe you can try converting data just once outside the loop. – seb Aug 6 '14 at 4:09

Mr Rumbalski, thank you for your response. I tried genfromtxt initially and ran into similar issues but will try once again. – user2476810 Aug 6 '14 at 12:52

seb, I tried to convert to np.array(data) outside of the for loop and received a memory error again – user2476810 Aug 6 '14 at 12:57

Is your data numeric? – Steven Rumbalski Aug 6 '14 at 14:36

| show 6 more comments

Your Answer

Sign up or log in

Post as a guest

Name

Post as a guest

Name

discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged python arrays csv numpy or ask your own question.

question feed

asked	2 years ago
viewed	264 times

current community

your communities

more stack exchange communities

Python memory error with numpy and csv reader

Your Answer

Browse other questions tagged python arrays csv numpy or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Python memory error with numpy and csv reader

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged python arrays csv numpy or ask your own question.

Related

Hot Network Questions