Efficient way to aggregate on dates in pandas groupby

Question

When performing a groupby on dates (as object), I realized it was way less efficient than on int. Here is an example:

df = pd.DataFrame({'id1':[1,1,1,1,2,2,2,3,3,3],'id2':[10,20,30,10,20,30,10,20,30,10],'value':[123,156,178,19,354,26,84,56,984,12],
                   'date':['2015-01-12','2014-09-27','2014-10-14','2010-11-26','2010-04-09','2012-12-21','2009-08-16',
                          '2013-07-09','2014-02-14','2012-12-04']})
df

Out[1]:
    date        id1 id2 value
0   2015-01-12  1   10  123
1   2014-15-27  1   20  156
2   2014-10-14  1   30  178
3   2010-11-26  1   10  19
4   2010-04-09  2   20  354
5   2012-12-21  2   30  26
6   2009-08-16  2   10  84
7   2013-07-09  3   20  56
8   2014-02-14  3   30  984
9   2012-12-04  3   10  12

Here are the types of the column:

df.dtypes

Out[2]:
date     object
id1       int64
id2       int64
value     int64
dtype: object

And now let's take a look at the efficiency of aggregations::

%timeit df.groupby(['id1','id2']).agg({'value':np.sum})
1000 loops, best of 3: 1.35 ms per loop

%timeit df.groupby(['id1','id2']).agg({'date':np.max})
100 loops, best of 3: 2.75 ms per loop

As you can see, it is twice as long for date as it is for value, which is inconvenient an big dataframes.

Is there a way to perform the agg more efficiently on dates? Maybe by changing the type of date column, or by using another function the get the max?

normally dates should be datetime but I note that 2014-15-27 is not a valid date unless you're using some funky calendar — EdChum, 18 hours ago
Indeed, I typed it thinking about my 13th, 14th and maybe 15th month apparently! Changed it. (Note that the type of date column is still object, I suspect this is the problem since python can't use dedicated functions for objects) — ysearka, 18 hours ago

EdChum · Accepted Answer · 2016-07-08 08:37:49Z

up vote 1 down vote accepted

Changing the dtype to datetime gives comparable perf for me:

In [86]:
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df

Out[86]:
        date  id1  id2  value
0 2015-01-12    1   10    123
1 2014-09-27    1   20    156
2 2014-10-14    1   30    178
3 2010-11-26    1   10     19
4 2010-04-09    2   20    354
5 2012-12-21    2   30     26
6 2009-08-16    2   10     84
7 2013-07-09    3   20     56
8 2014-02-14    3   30    984
9 2012-12-04    3   10     12

In [88]:
# with datetime dtype    
%timeit df.groupby(['id1','id2']).agg({'value':np.sum})
%timeit df.groupby(['id1','id2']).agg({'date':np.max})
1000 loops, best of 3: 1.56 ms per loop
1000 loops, best of 3: 1.33 ms per loop

In [91]:
# with `str` dtype    
%timeit df.groupby(['id1','id2']).agg({'value':np.sum})
%timeit df.groupby(['id1','id2']).agg({'date':np.max})
1000 loops, best of 3: 1.56 ms per loop
100 loops, best of 3: 3.07 ms per loop

So with a datetime dtype the agg performance is comparable to 'value' column aggregation performance.

answered 18 hours ago

EdChum

71.1k156886

Ok, so the idea indeed was to change the type of the column. I admit being a bit confused about the different formats of date in python, I will have to have a look at them. Thank you! – ysearka 18 hours ago

There is a handy website that explains the strftime format fields: strftime.org – EdChum 18 hours ago

Why is agg() actually needed? When I use df.groupby(['id1','id2']).sum(), I get the identical result. – Cleb 18 hours ago

@Cleb I'm assuming that this is just an example bit of code, the OP is likely to be using agg with a real list or dict of funcs in their real code or using agg out of habit – EdChum 18 hours ago

Ok, makes sense; was just wondering whether I miss something here. – Cleb 18 hours ago

| show 2 more comments

asked	today
viewed	25 times
active	today

current community

your communities

more stack exchange communities

Efficient way to aggregate on dates in pandas groupby

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python pandas aggregate or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Efficient way to aggregate on dates in pandas groupby

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python pandas aggregate or ask your own question.

Related

Hot Network Questions