I found myself many times in the past trying to generate fake DataFrames in pandas
. I decided just for fun, to write a script that I can specify some inputs and generate that DataFrame for me. Here you can find the code:
import random
import pandas as pd
from faker import Faker
def generate_dummy_dataframe(int_cols, float_cols, bool_cols, string_cols, datetime_cols, rows):
"""
Generates a dummy dataframe with the specified number of columns and rows for each data type.
Parameters
----------
int_cols : int
The number of integer columns to create.
float_cols : int
The number of float columns to create.
bool_cols : int
The number of boolean columns to create.
string_cols : int
The number of string columns to create.
datetime_cols : int
The number of datetime columns to create.
rows : int
The number of rows to create.
Returns
-------
df : pandas.DataFrame
A dataframe with the specified number of columns and rows for each data type, filled with sample data.
Raises
------
ValueError
If any of the inputs is not an integer greater than 0.
"""
# Check that input is valid
if not all(isinstance(i, int) and i > 0 for i in [int_cols, float_cols, bool_cols, string_cols, datetime_cols, rows]):
raise ValueError("All inputs must be integers greater than 0.")
# Create empty dataframe
df = pd.DataFrame()
# Create instance of Faker
fake = Faker()
# Add integer columns
for i in range(int_cols):
df['int_col_{}'.format(i)] = [random.randint(0, 100) for _ in range(rows)]
# Add float columns
for i in range(float_cols):
df['float_col_{}'.format(i)] = [random.uniform(0, 100) for _ in range(rows)]
# Add boolean columns
for i in range(bool_cols):
df['bool_col_{}'.format(i)] = [random.choice([1, 0]) for _ in range(rows)]
# Add string columns
for i in range(string_cols):
df['string_col_{}'.format(i)] = [fake.word() for _ in range(rows)]
# Add datetime columns
for i in range(datetime_cols):
df['datetime_col_{}'.format(i)] = [fake.date_time() for _ in range(rows)]
return df
Then you can call the function like this:
generate_dummy_dataframe(2, 2, 2, 2, 2, 10)
And generate the output as this:
int_col_0 int_col_1 float_col_0 float_col_1 bool_col_0 bool_col_1 \
0 26 13 51.154902 74.562551 1 1
1 84 6 94.790006 22.036552 1 1
2 90 23 53.832429 29.791543 0 0
3 64 94 4.101628 18.442224 0 0
4 85 21 17.637843 51.384612 1 0
5 16 62 43.132250 99.989500 0 0
6 7 61 29.019135 90.812649 0 0
7 7 44 32.686915 80.988226 0 1
8 75 41 89.628566 5.697429 0 0
9 95 32 46.610747 19.376951 0 1
string_col_0 string_col_1 datetime_col_0 datetime_col_1
0 fish network 2012-12-30 11:15:50 2002-09-14 08:39:36
1 industry rich 2004-09-21 13:26:07 1972-01-18 22:55:46
2 coach feel 2018-12-08 03:25:36 1998-06-19 18:18:27
3 image price 2017-07-16 19:37:53 2014-02-17 03:10:18
4 anything place 2015-01-19 20:45:41 2003-06-08 06:37:41
5 century possible 2004-06-02 02:35:25 1992-03-29 06:26:31
6 other expert 1985-04-14 16:27:21 2008-09-30 11:43:44
7 sound approach 1975-07-05 03:48:37 1978-03-20 00:08:46
8 information very 1989-10-15 15:52:22 2001-10-17 14:38:37
9 us more 1990-09-26 09:19:54 1975-01-11 12:07:19
Areas that I could improve this:
Performance
It's clear that the more rows I ask for, it will take even more time.
import time
def test_execution_time(rows):
start = time.time()
df = generate_dummy_dataframe(5, 5, 5, 5, 5, rows)
end = time.time()
print("Number of rows: {}".format(rows))
print("Execution time: {} seconds".format(end - start))
# Test execution time for different number of rows
test_execution_time(100)
test_execution_time(1000)
test_execution_time(10000)
test_execution_time(100000)
test_execution_time(1000000)
Number of rows: 100
Execution time: 0.05393719673156738 seconds
Number of rows: 1000
Execution time: 0.12861156463623047 seconds
Number of rows: 10000
Execution time: 1.0118906497955322 seconds
Number of rows: 100000
Execution time: 9.85063910484314 seconds
Number of rows: 1000000
Execution time: 98.54737830162048 seconds
Less libraries
Get rid of faker. The string columns doesn't have to be valid words. Just a dummy short text. Similar for datetime. I guess I can do that without faker.
Readability
Not sure here. Is there anything that can be done to make it more Pythonic and better formatted?