Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

Oftentimes I find myself converting pandas.DataFrame objects to lists of formatted row strings, so I can print the rows into, e.g. a tkinter.Listbox. To do this, I have been utilizing pandas.DataFrame.to_string. There is a lot of nice functionality built into the method, but when the number of dataframe rows/columns gets relatively large, to_string starts to tank.

Below I implement a custom pandas.DataFrame class with a few added methods for returning formatted row lines. I am looking to improve upon the get_lines_fast_struct method.

import pandas
np = pandas.np


class DataFrame2(pandas.DataFrame):
    def __init__( self, *args, **kwargs ):
        pandas.DataFrame.__init__(self, *args, **kwargs)

    def get_lines_standard(self):
        """standard way to convert pandas dataframe
            to lines with fomrmatted column spacing"""
        lines = self.to_string(index=False).split('\n')
        return lines

    def get_lines_fast_unstruct(self):
        """ lighter version of pandas.DataFrame.to_string()
            with no special spacing format"""
        df_recs    = self.to_records(index=False)
        col_titles = [' '.join(list(self))]
        col_data   = map(lambda rec:' '.join( map(str,rec) ), 
                         df_recs.tolist())
        lines = col_titles + col_data
        return lines

    def get_lines_fast_struct(self,col_space=1):
        """ lighter version of pandas.DataFrame.to_string()
            with special spacing format"""
        df_recs    = self.to_records(index=False) # convert dataframe to array of records
        str_data   = map(lambda rec: map(str,rec), df_recs ) # map each element to string
        self.space = map(lambda x:len(max(x,key=len))+col_space,  # returns the max string length in each column as a list
                         zip(*str_data)) 

        col_titles = [self._format_line(list(self))]
        col_data   = [self._format_line(row ) for row in str_data ]

        lines = col_titles + col_data
        return lines

    def _format_line(self, row_vals):
        """row_vals: list of strings.
           Adds variable amount of white space to each
           list entry and returns a single string"""
        line_val_gen = ( ('{0: >%d}'%self.space[i]).format(entry) for i,entry in enumerate(row_vals) )  # takes dataframe row entries and adds white spaces based on a format
        line = ''.join(line_val_gen)
        return line

#SOME TEST DATA
df = DataFrame2({'A':np.random.randint(0,1000,1000), 
                 'B':np.random.random(1000), 
                 'C':[random.choice(['EYE', '<3', 'PANDAS', '0.16']) 
                      for _ in range(1000)]})

Method outputs

df.get_lines_standard()[:5] # first five rows in dataframe
#[u'   A         B       C',
# u' 504  0.924385      <3',
# u' 388  0.285854    0.16',
# u' 984  0.254156    0.16',
# u' 446  0.472621  PANDAS']

df.get_lines_fast_struct()[:5] 
#['   A                 B      C',
# ' 504      0.9243853594     <3',
# ' 388    0.285854082778   0.16',
# ' 984    0.254155910401   0.16',
# ' 446    0.472621088021 PANDAS']

df.get_lines_fast_unstruct()[:5]
#['A B C',
# '504 0.9243853594 <3',
# '388 0.285854082778 0.16',
# '984 0.254155910401 0.16',
# '446 0.472621088021 PANDAS']

Timing results

In [262]: %timeit df.get_lines_standard()
10 loops, best of 3: 70.3 ms per loop

In [263]: %timeit df.get_lines_fast_struct()
100 loops, best of 3: 15.4 ms per loop

In [264]: %timeit df.get_lines_fast_unstruct()
100 loops, best of 3: 2.3 ms per loop
share|improve this question

1 Answer 1

import pandas
np = pandas.np

What you are doing here is using the numpy that pandas imports, which can lead to confusion. There is an agreed standard to import pandas and numpy:

import pandas as pd
import numpy as np

And importing numpy yourself does not load the module twice, as imports are cached. Your import only costs a lookup in sys.modules because numpy already gets imported on the pandas import, but you add a lot of readability.

At the end you use random.choice() but you never imported random.


In get_lines_standard() you first convert the complete DataFrame to a string, then split it on the line breaks. In your example and then you slice the top 5 off it to display. The way your code works here, there is no way to only show the top 5 rows without rendering the complete DataFrame - which applies to all 3 methods. Just to demonstrate the difference of slicing before and after (using random data generated at the end of your code but with 10k rows instead of 1k):

# both calls have the same output:

%timeit df.to_string(index=False).split('\n')[:5]
1 loops, best of 3: 1.51 s per loop

%timeit df[:5].to_string(index=False).split('\n')
100 loops, best of 3: 3.38 ms per loop

PS: I don't want to pep8ify you, but please don't line up your equal signs.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.