I have a large pandas dataframe (size = 3 GB):
x = read.table('big_table.txt', sep='\t', header=0, index_col=0)
Because I'm working under memory constraints, I subset the dataframe:
rows = calculate_rows() # a function that calculates what rows I need
cols = calculate_cols() # a function that calculates what cols I need
x = x.ix[rows, cols]
The functions that calculate the rows and columns are not important, but they are DEFINITELY a smaller subset of the original rows and columns. However, when I do this operation, memory usage increases by a lot! The original goal was to shrink the memory footprint to less than 3GB, but instead, memory usage goes well over 6GB.
I'm guessing this is because Python creates a local copy of the dataframe in memory, but doesn't clean it up. There may also be other things that are happening... So my question is how do I subset a large dataframe and clean up the space? I can't find a function that selects rows/cols in place.
I have read a lot of Stack Overflow, but can't find much on this topic. It could be I'm not using the right keywords, so if you have suggestions, that could also help. Thanks!
gc
soimport gc
and then calldelete df
and thengc.collect()
. However, in your case you should consider usingh5py
for larger than memory data. – EdChum Oct 30 '13 at 8:08