7

I am trying to do a somewhat complicated group and sort operation in pandas. I want to sort the groups by their values in ascending order, using successive values for tiebreaks as needed.

I have read the similar question Pandas dataframe: How to sort groups by the earliest time of a group, but that only uses the minimum value in each group for sorting and thus doesn't handle the case where two groups have the same minimum value but differ in their other values.

Likewise, pandas groupby, then sort within groups discusses within group sorting, but not between group sorting, which is what I'm after.

As a concrete example, consider the following dataframe:

df = pd.DataFrame({"pool": [5, 1, 9, 9, 5, 7, 7, 7, 9, 1, 5],
                   "arrival":[227, 60, 60, 88, 55, 55, 276, 46, 46, 35, 35]})

I want to sort the pools by arrival, such that the resulting dataframe is:

    pool  arrival
10     5       35
4      5       55
0      5      227
9      1       35
1      1       60
7      7       46
5      7       55
6      7      276
8      9       46
2      9       60
3      9       88

I have been able to accomplish this via the following code:

# create column to indicate order of values in each group
df = df.sort_values("arrival")
df["order"] = df.groupby("pool")["arrival"].cumcount()

# use 'order' column to make columns for each arrival position
df["first"] = df["second"] = df["third"] = np.nan
df.loc[df["order"] == 0,"first"] = df.loc[df["order"] == 0,"arrival"]
df.loc[df["order"] == 1,"second"] = df.loc[df["order"] == 1,"arrival"]
df.loc[df["order"] == 2,"third"] = df.loc[df["order"] == 2,"arrival"]

# propagate the values to every member of the group
df[["first","second","third"]] = df.groupby("pool")[["first","second","third"]].transform("max")

# for groups with less than three members, fill the values with previous ones
df["second"] = df["second"].fillna(df["first"])
df["third"] = df["third"].fillna(df["second"])

# sort by the arrival position columns, then drop all the helper columns
df = df.sort_values(["first","second","third","pool"]).drop(columns=["first","second","third","order"])

It works, but it's not particularly scalable to pools with larger numbers of arrivals (20+). I'm convinced there has to be a better way, but I can't figure out how to do it.

I also tried combining the transform and nth functions as discussed in Using transform together with nth but, contrary to the accepted answer on that question, trying to pass "nth" to groupby.transform raises ValueError: 'nth' is not a valid function name for transform(name) since nth may return none or multiple values for a given group and transform can't handle those cases.

3 Answers 3

8

You could sort and aggregate the values into tuples.

(df.sort_values("arrival")
   .groupby("pool", as_index=False)
   .agg(tuple)
)
#    pool        arrival
# 0     1       (35, 60)
# 1     5  (35, 55, 227)
# 2     7  (46, 55, 276)
# 3     9   (46, 60, 88)

Then sort by the tuples and explode() back into rows.

(df.sort_values("arrival")
   .groupby("pool", as_index=False)
   .agg(tuple)
   .sort_values("arrival")
   .explode("arrival")
)
#    pool arrival
# 1     5      35
# 1     5      55
# 1     5     227
# 0     1      35
# 0     1      60
# 2     7      46
# 2     7      55
# 2     7     276
# 3     9      46
# 3     9      60
# 3     9      88
4
  • Not only is this the one-liner(ish) that I was hoping for, I also learned I can use tuple as an argument to agg() - I had no idea!
    – Jessica
    Commented yesterday
  • 1
    That's an elegant approach, but groupby and python objects are slow in pandas. You can solve this more efficiently with simple sorting ;)
    – mozway
    Commented 23 hours ago
  • @mozway i suspect you mean a combination of groupby and python objects are slow
    – sammywemmy
    Commented 22 hours ago
  • 1
    @sammywemmy both are slower than vectorial functions individually. Forming the groups and applying the aggregation has a cost, then sorting the tuple objects has another cost. Assuming s = pd.Series([(1,), (2,), (3,)]*1000) ; s2 = s.str[0] and compare the timings of s.sort_values() and s2.sort_values(). Same number of items, same number of needed comparisons, but sorting the tuples is 10x slower.
    – mozway
    Commented 22 hours ago
7

You can obtain the order of the groups efficiently by sorting the values by arrival, reshaping with pivot and sorting all columns, which is similar to your original approach:

tmp = df.sort_values('arrival')
g = tmp.groupby('pool').cumcount()
order = (tmp.assign(col=g)
            .pivot(index='pool', columns='col', values='arrival')
            .sort_values(by=list(g.unique()))
            .index
        )
# Index([5, 1, 7, 9], dtype='int64', name='pool')

Then you can simply lexsort based on the custom order and other columns:

out = df.iloc[np.lexsort([df['arrival'],
                          df['pool'].map({k: v for v, k in enumerate(order)})])]

Or sort with a Categorical:

out = (df.astype({'pool': pd.CategoricalDtype(order, ordered=True)})
         .sort_values(by=['pool', 'arrival'])
      )

Output:

    pool  arrival
10     5       35
4      5       55
0      5      227
9      1       35
1      1       60
7      7       46
5      7       55
6      7      276
8      9       46
2      9       60
3      9       88

Timings

enter image description here

4
  • I don't think sort + unique is the order the OP wants. Let's create the following sample data frame. df = pd.DataFrame({'pool': [1, 1, 2, 2], 'arrival': [20, 40, 20, 30]}) A group with a pool of 2 should be above a group with a pool of 1. We can't solve this problem with sort+unique.
    – Panda Kim
    Commented 19 hours ago
  • @PandaKim that's a fair point, then you can pivot to ensure a correct order while keeping an efficient vectorized sorting. See the updated answer
    – mozway
    Commented 19 hours ago
  • If you use pivot + sort, you will need to adjust the position of null. I think need na_position='first'.
    – Panda Kim
    Commented 11 hours ago
  • 2
    Upvote for the greater efficiency of this approach. In my particular case, I'll never have more than ~100 rows - and often far fewer - so I prefer @jqurious's answer for its improved readability.
    – Jessica
    Commented 10 hours ago
1

Here is a much simpler way to achieve this:

df = df.groupby('pool').apply(lambda x: x.sort_values(by='arrival'), include_groups=False).reset_index().drop("level_1", axis=1)

Explanation:

First, we group by 'pool' since we only want to sort the 'arrival' values for every pool.

We then use .apply to get each group (value of pool) and sort the arrival values for every group (pool). We also set include_groups to False since setting it to True is depreciated (https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.apply.html)

We now get a data frame which looks like this:

arrival
pool            
1    9        35
     1        60
5    10       35
     4        55
     0       227
7    7        46
     5        55
     6       276
9    8        46
     2        60
     3        88

Close, but we don't need the index, so we reset the index (reset_index()):

    pool  level_1  arrival
0      1        9       35
1      1        1       60
2      5       10       35
3      5        4       55
4      5        0      227
5      7        7       46
6      7        5       55
7      7        6      276
8      9        8       46
9      9        2       60
10     9        3       88

We are so close now! We just need to drop the level_1 column (which is the original index for the rows) which was created.

.drop("level_1", axis=1)

The final output now looks like this:

    pool  arrival
0      1       35
1      1       60
2      5       35
3      5       55
4      5      227
5      7       46
6      7       55
7      7      276
8      9       46
9      9       60
10     9       88

If you want the index to be the original index for the rows, then instead of dropping the level_1 column, just set the index to be the level_1 column:

.set_index("level_1", drop=True).rename_axis(None, axis=0)

(This code should be after the .reset_index())

First we set the index to the level_1 column using set_index and set drop to True to remove the original column, and then we rename the index to nothing using the .rename_axis method with axis=0.

Result:

    pool  arrival
9      1       35
1      1       60
10     5       35
4      5       55
0      5      227
7      7       46
5      7       55
6      7      276
8      9       46
2      9       60
3      9       88

If you have any questions, feel free to ask me!

6
  • reset_index() has a option drop=True which omits the original index. Commented yesterday
  • @user19077881 The pool column is also in the index, and the pool column is needed
    – Aadvik
    Commented yesterday
  • 1
    This doesn't quite accomplish the goal - note that pool 5 (with values [35,55,227]) should come before pool 1 (with values [35, 60]) because 55 < 60.
    – Jessica
    Commented yesterday
  • @Jessica Oh, so you want to sort the pool based on the maximum value
    – Aadvik
    Commented yesterday
  • @Aadvik Not quite, I want it sorted by minimum value, then second lowest value, then third lowest value, etc.
    – Jessica
    Commented yesterday

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.