A custom Pandas dataframe to_string method

Question

Oftentimes I find myself converting pandas.DataFrame objects to lists of formatted row strings, so I can print the rows into, e.g. a tkinter.Listbox. To do this, I have been utilizing pandas.DataFrame.to_string. There is a lot of nice functionality built into the method, but when the number of dataframe rows/columns gets relatively large, to_string starts to tank.

Below I implement a custom pandas.DataFrame class with a few added methods for returning formatted row lines. I am looking to improve upon the get_lines_fast_struct method.

import pandas
np = pandas.np


class DataFrame2(pandas.DataFrame):
    def __init__( self, *args, **kwargs ):
        pandas.DataFrame.__init__(self, *args, **kwargs)

    def get_lines_standard(self):
        """standard way to convert pandas dataframe
            to lines with fomrmatted column spacing"""
        lines = self.to_string(index=False).split('\n')
        return lines

    def get_lines_fast_unstruct(self):
        """ lighter version of pandas.DataFrame.to_string()
            with no special spacing format"""
        df_recs    = self.to_records(index=False)
        col_titles = [' '.join(list(self))]
        col_data   = map(lambda rec:' '.join( map(str,rec) ), 
                         df_recs.tolist())
        lines = col_titles + col_data
        return lines

    def get_lines_fast_struct(self,col_space=1):
        """ lighter version of pandas.DataFrame.to_string()
            with special spacing format"""
        df_recs    = self.to_records(index=False) # convert dataframe to array of records
        str_data   = map(lambda rec: map(str,rec), df_recs ) # map each element to string
        self.space = map(lambda x:len(max(x,key=len))+col_space,  # returns the max string length in each column as a list
                         zip(*str_data)) 

        col_titles = [self._format_line(list(self))]
        col_data   = [self._format_line(row ) for row in str_data ]

        lines = col_titles + col_data
        return lines

    def _format_line(self, row_vals):
        """row_vals: list of strings.
           Adds variable amount of white space to each
           list entry and returns a single string"""
        line_val_gen = ( ('{0: >%d}'%self.space[i]).format(entry) for i,entry in enumerate(row_vals) )  # takes dataframe row entries and adds white spaces based on a format
        line = ''.join(line_val_gen)
        return line

#SOME TEST DATA
df = DataFrame2({'A':np.random.randint(0,1000,1000), 
                 'B':np.random.random(1000), 
                 'C':[random.choice(['EYE', '<3', 'PANDAS', '0.16']) 
                      for _ in range(1000)]})

Method outputs

df.get_lines_standard()[:5] # first five rows in dataframe
#[u'   A         B       C',
# u' 504  0.924385      <3',
# u' 388  0.285854    0.16',
# u' 984  0.254156    0.16',
# u' 446  0.472621  PANDAS']

df.get_lines_fast_struct()[:5] 
#['   A                 B      C',
# ' 504      0.9243853594     <3',
# ' 388    0.285854082778   0.16',
# ' 984    0.254155910401   0.16',
# ' 446    0.472621088021 PANDAS']

df.get_lines_fast_unstruct()[:5]
#['A B C',
# '504 0.9243853594 <3',
# '388 0.285854082778 0.16',
# '984 0.254155910401 0.16',
# '446 0.472621088021 PANDAS']

Timing results

In [262]: %timeit df.get_lines_standard()
10 loops, best of 3: 70.3 ms per loop

In [263]: %timeit df.get_lines_fast_struct()
100 loops, best of 3: 15.4 ms per loop

In [264]: %timeit df.get_lines_fast_unstruct()
100 loops, best of 3: 2.3 ms per loop

ferada · Answer 1 · 2015-08-09 14:46:35Z

import pandas
np = pandas.np

What you are doing here is using the numpy that pandas imports, which can lead to confusion. There is an agreed standard to import pandas and numpy:

import pandas as pd
import numpy as np

And importing numpy yourself does not load the module twice, as imports are cached. Your import only costs a lookup in sys.modules because numpy already gets imported on the pandas import, but you add a lot of readability.

At the end you use random.choice() but you never imported random.

In get_lines_standard() you first convert the complete DataFrame to a string, then split it on the line breaks. In your example and then you slice the top 5 off it to display. The way your code works here, there is no way to only show the top 5 rows without rendering the complete DataFrame - which applies to all 3 methods. Just to demonstrate the difference of slicing before and after (using random data generated at the end of your code but with 10k rows instead of 1k):

# both calls have the same output:

%timeit df.to_string(index=False).split('\n')[:5]
1 loops, best of 3: 1.51 s per loop

%timeit df[:5].to_string(index=False).split('\n')
100 loops, best of 3: 3.38 ms per loop

PS: I don't want to pep8ify you, but please don't line up your equal signs.

asked	5 days ago
viewed	51 times
active	today

current community

your communities

more stack exchange communities

A custom Pandas dataframe to_string method

Method outputs

Timing results

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python python-2.7 formatting pandas or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

A custom Pandas dataframe to_string method

Method outputs

Timing results

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python python-2.7 formatting pandas or ask your own question.

Related

Hot Network Questions