Stack Overflow is a community of 4.7 million programmers, just like you, helping each other.

Join them; it only takes a minute:

Sign up
Join the Stack Overflow community to:
  1. Ask programming questions
  2. Answer and help your peers
  3. Get recognized for your expertise

When performing a groupby on dates (as object), I realized it was way less efficient than on int. Here is an example:

df = pd.DataFrame({'id1':[1,1,1,1,2,2,2,3,3,3],'id2':[10,20,30,10,20,30,10,20,30,10],'value':[123,156,178,19,354,26,84,56,984,12],
                   'date':['2015-01-12','2014-09-27','2014-10-14','2010-11-26','2010-04-09','2012-12-21','2009-08-16',
                          '2013-07-09','2014-02-14','2012-12-04']})
df

Out[1]:
    date        id1 id2 value
0   2015-01-12  1   10  123
1   2014-15-27  1   20  156
2   2014-10-14  1   30  178
3   2010-11-26  1   10  19
4   2010-04-09  2   20  354
5   2012-12-21  2   30  26
6   2009-08-16  2   10  84
7   2013-07-09  3   20  56
8   2014-02-14  3   30  984
9   2012-12-04  3   10  12

Here are the types of the column:

df.dtypes

Out[2]:
date     object
id1       int64
id2       int64
value     int64
dtype: object

And now let's take a look at the efficiency of aggregations::

%timeit df.groupby(['id1','id2']).agg({'value':np.sum})
1000 loops, best of 3: 1.35 ms per loop

%timeit df.groupby(['id1','id2']).agg({'date':np.max})
100 loops, best of 3: 2.75 ms per loop

As you can see, it is twice as long for date as it is for value, which is inconvenient an big dataframes.

Is there a way to perform the agg more efficiently on dates? Maybe by changing the type of date column, or by using another function the get the max?

share|improve this question
    
normally dates should be datetime but I note that 2014-15-27 is not a valid date unless you're using some funky calendar – EdChum 18 hours ago
    
Indeed, I typed it thinking about my 13th, 14th and maybe 15th month apparently! Changed it. (Note that the type of date column is still object, I suspect this is the problem since python can't use dedicated functions for objects) – ysearka 18 hours ago
up vote 1 down vote accepted

Changing the dtype to datetime gives comparable perf for me:

In [86]:
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df

Out[86]:
        date  id1  id2  value
0 2015-01-12    1   10    123
1 2014-09-27    1   20    156
2 2014-10-14    1   30    178
3 2010-11-26    1   10     19
4 2010-04-09    2   20    354
5 2012-12-21    2   30     26
6 2009-08-16    2   10     84
7 2013-07-09    3   20     56
8 2014-02-14    3   30    984
9 2012-12-04    3   10     12

In [88]:
# with datetime dtype    
%timeit df.groupby(['id1','id2']).agg({'value':np.sum})
%timeit df.groupby(['id1','id2']).agg({'date':np.max})
1000 loops, best of 3: 1.56 ms per loop
1000 loops, best of 3: 1.33 ms per loop

In [91]:
# with `str` dtype    
%timeit df.groupby(['id1','id2']).agg({'value':np.sum})
%timeit df.groupby(['id1','id2']).agg({'date':np.max})
1000 loops, best of 3: 1.56 ms per loop
100 loops, best of 3: 3.07 ms per loop

So with a datetime dtype the agg performance is comparable to 'value' column aggregation performance.

share|improve this answer
    
Ok, so the idea indeed was to change the type of the column. I admit being a bit confused about the different formats of date in python, I will have to have a look at them. Thank you! – ysearka 18 hours ago
    
There is a handy website that explains the strftime format fields: strftime.org – EdChum 18 hours ago
    
Why is agg() actually needed? When I use df.groupby(['id1','id2']).sum(), I get the identical result. – Cleb 18 hours ago
    
@Cleb I'm assuming that this is just an example bit of code, the OP is likely to be using agg with a real list or dict of funcs in their real code or using agg out of habit – EdChum 18 hours ago
    
Ok, makes sense; was just wondering whether I miss something here. – Cleb 18 hours ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.