Take the 2-minute tour ×
Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free.

I am working with quite large files (pytables) and I am having problems with the Memory Error when I try to load the data for processing.

I would like some tips about how to avoid this in my python 32bits, since I am new working with pandas and pytables, and I do not know how to work splitting the data in small pieces.

My concern also comes when, if I get to split the data, how to calculate statistics like mean, std, etc without having the entire list or array, etc.

This is a sample of the code that I am using now, this works fine with small tables:

def getPageStats(pathToH5, pages, versions, sheets):

    with openFile(pathToH5, 'r') as f:
        tab = f.getNode("/pageTable")

        dversions = dict((i, None) for i in versions)
        dsheets = dict((i, None) for i in sheets)
        dpages = dict((i, None) for i in pages)


        df = pd.DataFrame([[row['page'],row['index0'], row['value0'] ] for row in tab.where('(firstVersion == 0) & (ok == 1)') if  row['version'] in dversions and row['sheetNum'] in dsheets and row['pages'] in dpages ], columns=['page','index0', 'value0'])        
        df2 = pd.DataFrame([[row['page'],row['index1'], row['value1'] ] for row in tab.where('(firstVersion == 1) & (ok == 1)') if  row['version'] in dversions and row['sheetNum'] in dsheets and row['pages'] in dpages], columns=['page','index1', 'value1'])        

        for i in dpages:


            m10 = df.loc[df['page']==i]['index0'].mean()
            s10 = df.loc[df['page']==i]['index0'].std()

            m20 = df.loc[df['page']==i]['value0'].mean()
            s20 = df.loc[df['page']==i]['value0'].std()

            m11 = df2.loc[df2['page']==i]['index1'].mean()
            s11 = df2.loc[df2['page']==i]['index1'].std()

            m21 = df2.loc[df2['page']==i]['value1'].mean()
            s21 = df2.loc[df2['page']==i]['value1'].std()

            yield (i,m10, s10), (i,m11, s11), (i,m20,s20), (i,m21,s21)) 

As you can see, I am loading all the necessary data into a Pandas DataFrame to procoess it, just the mean and the std by now.

This is quite fast, but for a pytable with 22Millions of rows I get Memory Error

share|improve this question
    
How much memory does a given row take? How much memory is allocated to the process? –  MichaelT Jun 18 '14 at 19:18
    
Thanks for your answer Michael, I do not know exactly how much memo takes each row, but its just a float32 value in the filed of the row that I am interested in. And, about how much memo allocated, I do not know how to check this, sorry I am quite inexpert. –  newPyUser Jun 19 '14 at 6:02

1 Answer 1

As far as I know Pandas is not the best tool if you can not store everything in the memory.

Additionally you are creating some extra data that you might try to avoid. I'm talking about the list comprehensions.

For once they are a bit too big/complex to be a list comprehension as for me.

Secondly due to the its nature for short period of time your are holding to much data: complete list + it's copy representation within DataFrame. With the second assignment (df2) you are holding new DF, list and df. Not to mention all other object you have already created.

  1. Try to use generator instead of list comprehension. Either in line by replacing [...] with (...) or proper generator that will be more readable too.

  2. (When solution 1 will not be enough) Drop Pandas and do calculations manually. That might require to iterate over the data twice, but will get you the result.

share|improve this answer
    
Thanks, very interesting comments! –  newPyUser Jun 22 '14 at 11:17

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.