1
\$\begingroup\$

I found myself many times in the past trying to generate fake DataFrames in pandas. I decided just for fun, to write a script that I can specify some inputs and generate that DataFrame for me. Here you can find the code:

import random
import pandas as pd
from faker import Faker

def generate_dummy_dataframe(int_cols, float_cols, bool_cols, string_cols, datetime_cols, rows):
  """
  Generates a dummy dataframe with the specified number of columns and rows for each data type.

  Parameters
  ----------
  int_cols : int
      The number of integer columns to create.
  float_cols : int
      The number of float columns to create.
  bool_cols : int
      The number of boolean columns to create.
  string_cols : int
      The number of string columns to create.
  datetime_cols : int
      The number of datetime columns to create.
  rows : int
      The number of rows to create.

  Returns
  -------
  df : pandas.DataFrame
      A dataframe with the specified number of columns and rows for each data type, filled with sample data.
      
  Raises
  ------
  ValueError
      If any of the inputs is not an integer greater than 0.
  """
  # Check that input is valid
  if not all(isinstance(i, int) and i > 0 for i in [int_cols, float_cols, bool_cols, string_cols, datetime_cols, rows]):
    raise ValueError("All inputs must be integers greater than 0.")

  # Create empty dataframe
  df = pd.DataFrame()

  # Create instance of Faker
  fake = Faker()

  # Add integer columns
  for i in range(int_cols):
    df['int_col_{}'.format(i)] = [random.randint(0, 100) for _ in range(rows)]
  
  # Add float columns
  for i in range(float_cols):
    df['float_col_{}'.format(i)] = [random.uniform(0, 100) for _ in range(rows)]
  
  # Add boolean columns
  for i in range(bool_cols):
    df['bool_col_{}'.format(i)] = [random.choice([1, 0]) for _ in range(rows)]
  
  # Add string columns
  for i in range(string_cols):
    df['string_col_{}'.format(i)] = [fake.word() for _ in range(rows)]
  
  # Add datetime columns
  for i in range(datetime_cols):
    df['datetime_col_{}'.format(i)] = [fake.date_time() for _ in range(rows)]
  
  return df

Then you can call the function like this:

 generate_dummy_dataframe(2, 2, 2, 2, 2, 10)

And generate the output as this:

   int_col_0  int_col_1  float_col_0  float_col_1  bool_col_0  bool_col_1  \
0         26         13    51.154902    74.562551           1           1   
1         84          6    94.790006    22.036552           1           1   
2         90         23    53.832429    29.791543           0           0   
3         64         94     4.101628    18.442224           0           0   
4         85         21    17.637843    51.384612           1           0   
5         16         62    43.132250    99.989500           0           0   
6          7         61    29.019135    90.812649           0           0   
7          7         44    32.686915    80.988226           0           1   
8         75         41    89.628566     5.697429           0           0   
9         95         32    46.610747    19.376951           0           1   

  string_col_0 string_col_1      datetime_col_0      datetime_col_1  
0         fish      network 2012-12-30 11:15:50 2002-09-14 08:39:36  
1     industry         rich 2004-09-21 13:26:07 1972-01-18 22:55:46  
2        coach         feel 2018-12-08 03:25:36 1998-06-19 18:18:27  
3        image        price 2017-07-16 19:37:53 2014-02-17 03:10:18  
4     anything        place 2015-01-19 20:45:41 2003-06-08 06:37:41  
5      century     possible 2004-06-02 02:35:25 1992-03-29 06:26:31  
6        other       expert 1985-04-14 16:27:21 2008-09-30 11:43:44  
7        sound     approach 1975-07-05 03:48:37 1978-03-20 00:08:46  
8  information         very 1989-10-15 15:52:22 2001-10-17 14:38:37  
9           us         more 1990-09-26 09:19:54 1975-01-11 12:07:19  

Areas that I could improve this:

Performance

It's clear that the more rows I ask for, it will take even more time.

import time

def test_execution_time(rows):
  start = time.time()
  df = generate_dummy_dataframe(5, 5, 5, 5, 5, rows)
  end = time.time()
  print("Number of rows: {}".format(rows))
  print("Execution time: {} seconds".format(end - start))

# Test execution time for different number of rows
test_execution_time(100)
test_execution_time(1000)
test_execution_time(10000)
test_execution_time(100000)
test_execution_time(1000000)

Number of rows: 100
Execution time: 0.05393719673156738 seconds
Number of rows: 1000
Execution time: 0.12861156463623047 seconds
Number of rows: 10000
Execution time: 1.0118906497955322 seconds
Number of rows: 100000
Execution time: 9.85063910484314 seconds
Number of rows: 1000000
Execution time: 98.54737830162048 seconds

Less libraries

Get rid of faker. The string columns doesn't have to be valid words. Just a dummy short text. Similar for datetime. I guess I can do that without faker.

Readability

Not sure here. Is there anything that can be done to make it more Pythonic and better formatted?

\$\endgroup\$
2
  • \$\begingroup\$ why can't a column be 0? \$\endgroup\$
    – depperm
    Commented Jan 4, 2023 at 18:23
  • \$\begingroup\$ Good point. I missed that when I was thinking about the rows input. I will change it so the columns can be 0 but rows >0 \$\endgroup\$
    – Tasos
    Commented Jan 4, 2023 at 18:46

1 Answer 1

2
\$\begingroup\$

Re. performance: you're (forgive me) going about it all wrong. Performant code requires that you

  • allocate memory contiguously whenever possible
  • use the vectorisation capabilities of your libraries, in this case Numpy, rather than loops

That requires a rewrite, which I will partially demonstrate. Otherwise:

Use PEP484 typehints and pull that information away from the docstring.

The docstring is repetitive. If the description and Returns say the same thing, scrap the description and keep the Returns.

Your validation hides the specific problem. Loop through and include the name of the problematic column in your error message.

I demonstrate a method to generate datetimes without Faker. Strings are also possible but I leave this as an exercise to the reader.

The parameters are both too verbose and not generic enough. You can reframe this as a single dictionary (or, if you want, kwargs) of dtypes to column counts. This will be called like

df = generate_dummy_df(
    cols={
        np.uint8: 4,
        np.bool_: 1,
        np.float64: 2,
        'datetime64[s]': 1,
        'datetime64[ms]': 1,
        'datetime64[ns]': 1,
    },
    rows=20,
)

Suggested

On my nothing-special computer, this generates 10,000,000 rows in 1.3 seconds. Mind you it's not apples-to-apples because I don't include strings.

from timeit import timeit
from typing import Optional, Union

import numpy as np
import pandas as pd
from numpy.random import default_rng


def generate_dummy_df(cols: dict[Union[str, np.dtype], int], rows: int, seed: Optional[int] = None) -> pd.DataFrame:
    """
    Parameters
    ----------
    cols
        The number of columns to create for each given datatype.
    rows
        The number of rows to create.
    seed
        Provided to the random generator.

    Returns
    -------
    df
        A dummy dataframe filled with random data.

    Raises
    ------
    ValueError
        If any of the inputs is not an integer greater than 0.
    """

    for dtype, n in cols.items():
        if not isinstance(n, int) or n < 1:
            raise ValueError(f"{dtype} must have a positive column count")

    if not isinstance(rows, int) or rows < 1:
        raise ValueError(f'{rows} is not a valid row count')

    rand = default_rng(seed)

    def make_cols():
        y20_in_ns = 20 * 365.2425 * 24 * 60 * 60 * 1e9
        start = np.datetime64('2000-01-01', 'ns')
        end = start + np.timedelta64(int(y20_in_ns), 'ns')

        for dtype, n in cols.items():
            shape = n, rows
            mode = dict(size=shape, dtype=dtype)

            if np.issubdtype(dtype, np.integer):
                array = rand.integers(**mode, low=0, high=100)
            elif np.issubdtype(dtype, np.floating):
                # uniform() does not support dtype. Also, this will only work with float32 or float64.
                # For something more exotic, you need a separate branch.
                array = rand.random(**mode)*100.
            elif np.issubdtype(dtype, np.bool_):
                array = rand.integers(**mode, low=0, high=1, endpoint=True)
            elif np.issubdtype(dtype, np.datetime64):
                array = rand.integers(
                    size=shape, dtype=np.int64, low=start, high=end
                ).astype('datetime64[ns]').astype(dtype)
            else:
                raise ValueError(f'Type {dtype.__name__} is not supported')

            for i, col in enumerate(array):
                if isinstance(dtype, str):
                    type_name = dtype
                else:
                    type_name = dtype.__name__
                yield f'{type_name}_col_{i}', col

    return pd.DataFrame(dict(make_cols()))


def run_test():
    return generate_dummy_df(
        cols={
            np.int64: 5,
            np.float64: 5,
            np.bool_: 5,
            'datetime64[ns]': 5,
        },
        rows=10_000_000,
    )

t = timeit(run_test, number=1)
print(t)
\$\endgroup\$
1
  • \$\begingroup\$ Wow, that's a totally new world to me. I need to study the code and understand what you did. But performance wised, it is day and night :) \$\endgroup\$
    – Tasos
    Commented Jan 5, 2023 at 9:28

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.