How to avoid Memory Error

Question

I am working with quite large files (pytables) and I am having problems with the Memory Error when I try to load the data for processing.

I would like some tips about how to avoid this in my python 32bits, since I am new working with pandas and pytables, and I do not know how to work splitting the data in small pieces.

My concern also comes when, if I get to split the data, how to calculate statistics like mean, std, etc without having the entire list or array, etc.

This is a sample of the code that I am using now, this works fine with small tables:

def getPageStats(pathToH5, pages, versions, sheets):

    with openFile(pathToH5, 'r') as f:
        tab = f.getNode("/pageTable")

        dversions = dict((i, None) for i in versions)
        dsheets = dict((i, None) for i in sheets)
        dpages = dict((i, None) for i in pages)


        df = pd.DataFrame([[row['page'],row['index0'], row['value0'] ] for row in tab.where('(firstVersion == 0) & (ok == 1)') if  row['version'] in dversions and row['sheetNum'] in dsheets and row['pages'] in dpages ], columns=['page','index0', 'value0'])        
        df2 = pd.DataFrame([[row['page'],row['index1'], row['value1'] ] for row in tab.where('(firstVersion == 1) & (ok == 1)') if  row['version'] in dversions and row['sheetNum'] in dsheets and row['pages'] in dpages], columns=['page','index1', 'value1'])        

        for i in dpages:


            m10 = df.loc[df['page']==i]['index0'].mean()
            s10 = df.loc[df['page']==i]['index0'].std()

            m20 = df.loc[df['page']==i]['value0'].mean()
            s20 = df.loc[df['page']==i]['value0'].std()

            m11 = df2.loc[df2['page']==i]['index1'].mean()
            s11 = df2.loc[df2['page']==i]['index1'].std()

            m21 = df2.loc[df2['page']==i]['value1'].mean()
            s21 = df2.loc[df2['page']==i]['value1'].std()

            yield (i,m10, s10), (i,m11, s11), (i,m20,s20), (i,m21,s21))

As you can see, I am loading all the necessary data into a Pandas DataFrame to procoess it, just the mean and the std by now.

This is quite fast, but for a pytable with 22Millions of rows I get Memory Error

How much memory does a given row take? How much memory is allocated to the process? — MichaelT, Jun 18 '14 at 19:18
Thanks for your answer Michael, I do not know exactly how much memo takes each row, but its just a float32 value in the filed of the row that I am interested in. And, about how much memo allocated, I do not know how to check this, sorry I am quite inexpert. — newPyUser, Jun 19 '14 at 6:02

Fuxi · Answer 1 · 2014-06-21 18:16:22Z

As far as I know Pandas is not the best tool if you can not store everything in the memory.

Additionally you are creating some extra data that you might try to avoid. I'm talking about the list comprehensions.

For once they are a bit too big/complex to be a list comprehension as for me.

Secondly due to the its nature for short period of time your are holding to much data: complete list + it's copy representation within DataFrame. With the second assignment (df2) you are holding new DF, list and df. Not to mention all other object you have already created.

Try to use generator instead of list comprehension. Either in line by replacing [...] with (...) or proper generator that will be more readable too.
(When solution 1 will not be enough) Drop Pandas and do calculations manually. That might require to iterate over the data twice, but will get you the result.

Thanks, very interesting comments! – newPyUser Jun 22 '14 at 11:17 — newPyUser, Jun 22 '14 at 11:17

asked	1 year ago
viewed	6531 times
active	1 year ago

current community

your communities

more stack exchange communities

How to avoid Memory Error

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python memory-usage or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

How to avoid Memory Error

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python memory-usage or ask your own question.

Related

Hot Network Questions