Generating a PANDAS DataFrame of simulated coin tosses

Question

It's taking my machine quite a long time to execute 1 billion (1st loop x 10, 2nd loop x 1000, 3rd loop x 100,000) instructions. Suggestions for performance enhancements? Sources of potential concern:

global variables slower than local variables?
using for loops instead of something like map?
overhead for class instantiation?
overhead of storing results in a dictionary before making a pandas frame?

from numpy import random
from pandas import DataFrame, concat

class coin(object):

    HEADS = 1
    TAILS = 0
    FLIP_TIMES = 10

    def __init__(self):        
        self.frequency = self.flip10()
    
    def flip10(self):       
        choices = [coin.HEADS, coin.TAILS]
        record = []
        for i in xrange(coin.FLIP_TIMES):            
            record.append(random.choice(choices))
        return sum(record) / float(coin.FLIP_TIMES)
        
class trial(object):

    NUM_COINS = 1000
    DEFAULT_VAL = 11.0 / 10.0
    C_RAND_IDX = random.randint(0, NUM_COINS)

    def __init__(self):
        self.results = self.flip1000()

    def flip1000(self):
        v_rand = trial.DEFAULT_VAL
        v_first = trial.DEFAULT_VAL
        v_min = trial.DEFAULT_VAL
        for i in xrange(trial.NUM_COINS):
        
            # grab frequency
            v_i = coin().frequency
        
            # hold on to pocket minimum frequency
            if v_i < v_min:
                v_min = v_i
            
            # get random frequency
            if i == trial.C_RAND_IDX:
                v_rand = v_i
            
            # get first
            if i == 0:
                v_first = v_i           
        
        # some error checking to make sure they don't
        # still have their default values at this point
        # would be great    
        return {'rand': v_rand, 'first': v_first, 'min': v_min}   

data = DataFrame([trial().results for i in xrange(100000)])

Moving the local variable choices = [coin.HEADS, coin.TAILS to global scope gives slight performance improvement. However, since your loop counts are constants, I suspect operating on a single large array may be faster than an object oriented solution. — subhacom, Commented Oct 5, 2015 at 2:31

holroy · Accepted Answer · 2015-10-07 08:51:48Z

You are in for a treat, first of all a style review of existing code, followed by a code review of your current code, followed by a code refactoring, and topped of with a performance review of current solutions. All a part of the experience here at Code Review for free! Sit back, and enjoy. :-)

Style review

Actually it seems like you've done some coding before, and so most of the style is quite good. There are however some issues, and they are all related to naming:

Class names is suggested to be CamelCase – And neither coin or trial is good class names. When looking at the function they provide better names could possibly be CoinFlip and RepeatedCoinFlip
Don't use abbreviations in any names – What is it with v_rand or C_RAND_IDX and most other variable names? When you have to guess what the variable is, then it is a bad choice of variable name
Don't let a function name contain the parameter – It is quite clear that both the flip10 and flip1000 are named after something which should be a parameter. This is not dynamic, and doesn't easily extend. What if you wanted the flip10 to flip it a hundred times?
Bonus point for using xrange in Python 2.7

Code review

There are several code smells related to your code, sadly enough, which affects both readability as well as efficiency of your code. Lets address these:

Letting the __init__ do all the work – When you can't do anything with the class except what is done through the initialisation of the class, you should reconsider whether you are on the right path or not. In most cases this should be written as a simple function. If insisting on using class, you should strip down the initialisation, and let methods do the actual work
Initialisation of C_RAND_IDX should most likely be in __init__ – I'm guessing, but I'm assuming that this should change for every instantation of the trial class. In this case, it should not be a constant, or it should be set within the __init__ method
Hardcoding of parameters – Having methods like flip10 and flip1000 is going against what we have methods for. They should really have a parameter allowing for this number to change. You could have a default parameter, but it should be possible to change it
Calculating coin frequency using arrays and float calculation is waste of memory and calculation power – In your current code there is no reason to store every coin toss in an array when you all you need is the sum of all coin tosses (which represent the tail coin tosses). You could easily sum it directly. (Taking it all the way, you could even do a random.randit(0, coin.FLIP_TIMES) instead of the sum :) )
Simplify coin toss – To rebuild the choices = [coin.HEADS, coin.TAILS] and use random.choice() seems like waisted energy. Why not simply use random.randint(0, 2) which will give the same result?
Why use classes, when methods suffice? – As already touched upon, your scenario doesn't seem to be a good case for classes as they only provide one single return in both cases. This could most likely be better handled in methods directly.
Avoid extensive float calculations, if possible – Aritmethic with floats are more expensive then int arithmetic. Int comparision is also cheaper than float comparison. In other words, why compute the frequency as a float when the integer is just as good. You can always return the frequency as a float in the return statement. Which would reduce from 100 000 times * 1000 * 10 float operations, to 100 000 * 3 float operations. In total that is 333 333 less float division operations (and comparison and so on). That should be noticeable.

Code refactor

Lets give a pure methods implementation (before comparing some of these methods).

import random 

def coin_flip(flip_count):
    """Returns how mains tails, 1, out flip_count coin flips"""
    return sum(random.randint(0, 2) for _ in xrange(flip_count))

def fake_coin_flip(flip_count):
    """Returns a random tails count"""
    return random.randint(0, flip_count+1)

def repeated_coin_flip(repetitions=1000,
                       flip_count=10,
                       coin_flip_function=coin_flip):
    """Repeat some frequencies from a large repetition of coin flip sequences.

    Repeat <repetitions> of <flip_count> coin flips. Out of these return three
    frequencies, namely the first, the minimum and a random frequency.
    """

    minimum_count = repetitions + 1
    random_index = random.randint(0, repetitions)

    for i in xrange(repetitions):
        count_of_tails = coin_flip_function(flip_count)
        #print('coin_flip: {}'.format(count_of_tails))
        if count_of_tails < minimum_count:
            minimum_count = count_of_tails

        if i == random_index:
            random_count = count_of_tails

        if i == 0:
            first_count = count_of_tails

    flip_count_as_float = float(flip_count)
    return {'rand': random_count / flip_count_as_float,
            'first': first_count / flip_count_as_float,
            'min': minimum_count / flip_count_as_float }

Performance review

I've timed four solution based on answers so far: Original solution by compguy24, solution by SuperBiasedMan, and two variations of mine solution. In order to time them I've done a few adaptations, and only focus on the build of the data list. Here is the interesting part of the performance setup:

# Helper function to simulate the need of the full data set
# Could print out the first two, and last two elements
def print_some_data(data):
    """Print the two first and two last data elements"""
    for start_element in data[:2]:
        print('    first: {:,.4f},   min: {:,.4f},   rand: {:,.4f}'.format(
              start_element['first'], start_element['min'], start_element['rand']))

    print('    ... skipping loads of items ...')

    for end_element in data[-2:]:
        print('    first: {:,.4f},   min: {:,.4f},   rand: {:,.4f}'.format(
              end_element['first'], end_element['min'], end_element['rand']))

# After renaming classes to Coin & Trial
def compguy24_solution(datasize=1000, print_data = True):
    print('    Generating {} elements'.format(datasize))
    data = [Trial().results for i in xrange(datasize)]
    if print_data:
        print_some_data(data)

def SuperBiasedMan_solution(datasize=1000, print_data=True):
    print('    Generating {} elements'.format(datasize))
    data = [trial() for _ in xrange(datasize)]
    if print_data:
        print_some_data(data)

def holroy_solution(datasize=1000, print_data=True):
    print('    Generating {} elements'.format(datasize))
    data = [repeated_coin_flip(1000, 10) for _ in xrange(datasize)]
    if print_data:
        print_some_data(data)

# Same as previous, but faking the coin_flip :-D
def holroy_solution_v2(datasize=1000, print_data=True):
    print('    Generating {} elements'.format(datasize))
    data = [repeated_coin_flip(1000, 10, fake_coin_flip) for _ in xrange(datasize)]
    if print_data:
        print_some_data(data)


def main():
    test_case = "from {0} import {1}; {1}({2}, False)"

    for test_function in ('compguy24_solution',
                          'SuperBiasedMan_solution',
                          'holroy_solution',
                          'holroy_solution_v2',):

        print ('\nTesting {}'.format(test_function))

        datasize = 1000
        print('    execution time: {:,.4f} seconds'.format(
              timeit.timeit(test_case.format(__name__, test_function, datasize),
                                             number=1)))


if __name__ == '__main__':
   main()

Updated: I did a few test runs to compare the different solutions, before doing a final test run of 100 000 times (which I let run over night). With a 1000 samples I found the following: Original code (58.6 seconds) and the version by SuperBiasedMan (56.5 seconds) runs in around a minute, whilst my version using int comparison and simpler random function runs in only 2.8 seconds (about 20 times faster). ( And if faking the coin flip, it only takes 0.5 seconds to complete it. :-) )

But here are the ultimate test for a 100 000 times run on my computer:

Testing compguy24_solution
    Generating 100000 elements
    execution time: 5,600.1455 seconds

Testing SuperBiasedMan_solution
    Generating 100000 elements
    execution time: 5,565.2747 seconds

Testing holroy_solution
    Generating 100000 elements
    execution time: 277.3780 seconds

Testing holroy_solution_v2
    Generating 100000 elements
    execution time: 43.8776 seconds

That is the original and the solution by SuperBiasedMan took around 1.5 hours to complete, whilst my solution took 4.5 minutes. (The fake coin_flip variant under a minute). I think it is somewhat clear which solution I would prefer! Sorry guys! :-D

I think you could have posted this as multiple answers and gotten credit for each one. — 200_success, Commented Oct 6, 2015 at 21:26
@200_success, I think I'm gonna keep this as one answer, but which (and how many) parts would you have divided it into? — holroy, Commented Oct 7, 2015 at 8:53

SuperBiasedMan · Accepted Answer · 2015-10-06 09:25:55Z

It's good that you're using sum because that saves significant time, but you could save even more by passing a generator expression directly instead of a list. A generator expression is basically a for loop collapsed into an expression, and sum can take one instead of a list. This is how you'd write it:

    return (sum(random.choice(choices) for _ in xrange(coin.FLIP_TIMES))
            / float(coin.FLIP_TIMES))

It's a little less neat but saves you time.

You also recreate the choices list each time you call flip10. In this program it doesn't seem to make a difference but it's still bad practice. choices should be a constant that just exists as an attribute.

I don't like the idea of naming functions flip10 and flip1000, especially since it sounds like they'd conflict. Instead, use the name flip and pass a number to it. There's no clear reason I can see to make it a fixed constant, you could make it a default value. I'd make choices a constant rather than rebuild it for each flip call, and making it a tuple fits for this case too as tuples are immutable.

class coin(object):

    # Heads and tails
    CHOICES = (1, 0)

    def __init__(self):
        self.frequency = self.flip()

    def flip(self, flip_times=10):
        return (sum(random.choice(CHOICES) for _ in xrange(flip_times))
                / float(flip_times))

I definitely don't even think trial or coin should be classes. They're used as procedures, so it makes more sense to make functions. In both cases you're just instantiating them to run one function and get a single attribute as a result. This is exactly what functions are for. It is also less performant to use! I changed them to functions instead of classes and shaved off 10% of the execution time even without making any other changes.

C_RAND_IDX confused me at first. It's not clear and nor is it explained with a comment. Why is it a constant if it's randomly picked? It seems like it's supposed to be a persistent value in the class to check on each trial object created. This is awkward to do without a clear explanation. But if I have understood correctly, you can still do this with a function. Functions in Python are still just objects and can take attributes just like a class, so you could just assign this after the function definition:

def trial(num_coins):
    ...
        if i == trial.C_RAND_IDX:
            v_rand = v_i
    ...
    return result
trial.NUM_COINS = 1000
trial.C_RAND_IDX = random.randint(0, trial.NUM_COINS)

As you can see you still call it with trial.C_RAND_IDX just as you had before. For this structure you need to keep NUM_COINS as a constant attribute. Much as I don't like this, it will do what your old code did and save time.

You can also assign multiple equal variables at once, simplifying the v_ assignments:

    v_rand = v_first = v_min = DEFAULT_VAL

That said those names are strange and unclear. Why can't they drop the v_ prefix? If it's necessary, please explain why. (note that I don't use Pandas so if it's a style from there I'm unfamiliar with it).

Here's how I'd rewrite the code overall:

def coin(flip_times=10):
    return (sum(random.choice(coin.CHOICES) for _ in xrange(flip_times))
            / float(flip_times))
coin.CHOICES = (1, 0)

def trial(num_coins=1000):
    v_rand = v_first = v_min = trial.DEFAULT_VAL

    for i in xrange(num_coins):
        # grab frequency
        v_i = coin()
        # hold on to pocket minimum frequency
        if v_i < v_min:
            v_min = v_i

        # get random frequency
        if i == trial.C_RAND_IDX:
            v_rand = v_i

        # get first
        if i == 0:
            v_first = v_i           

    return {'rand': v_rand, 'first': v_first, 'min': v_min}   
trial.NUM_COINS = 1000
trial.DEFAULT_VAL = 11.0 / 10.0
trial.C_RAND_IDX = random.randint(0, trial.NUM_COINS)

Stack Exchange Network

Generating a PANDAS DataFrame of simulated coin tosses

2 Answers 2

Style review

Code review

Code refactor

Performance review

Your Answer

Hot Network Questions

Generating a PANDAS DataFrame of simulated coin tosses

2 Answers 2

Style review

Code review

Code refactor

Performance review

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions