I am trying to do a somewhat complicated group and sort operation in pandas. I want to sort the groups by their values in ascending order, using successive values for tiebreaks as needed.
I have read the similar question Pandas dataframe: How to sort groups by the earliest time of a group, but that only uses the minimum value in each group for sorting and thus doesn't handle the case where two groups have the same minimum value but differ in their other values.
Likewise, pandas groupby, then sort within groups discusses within group sorting, but not between group sorting, which is what I'm after.
As a concrete example, consider the following dataframe:
df = pd.DataFrame({"pool": [5, 1, 9, 9, 5, 7, 7, 7, 9, 1, 5],
"arrival":[227, 60, 60, 88, 55, 55, 276, 46, 46, 35, 35]})
I want to sort the pools by arrival, such that the resulting dataframe is:
pool arrival
10 5 35
4 5 55
0 5 227
9 1 35
1 1 60
7 7 46
5 7 55
6 7 276
8 9 46
2 9 60
3 9 88
I have been able to accomplish this via the following code:
# create column to indicate order of values in each group
df = df.sort_values("arrival")
df["order"] = df.groupby("pool")["arrival"].cumcount()
# use 'order' column to make columns for each arrival position
df["first"] = df["second"] = df["third"] = np.nan
df.loc[df["order"] == 0,"first"] = df.loc[df["order"] == 0,"arrival"]
df.loc[df["order"] == 1,"second"] = df.loc[df["order"] == 1,"arrival"]
df.loc[df["order"] == 2,"third"] = df.loc[df["order"] == 2,"arrival"]
# propagate the values to every member of the group
df[["first","second","third"]] = df.groupby("pool")[["first","second","third"]].transform("max")
# for groups with less than three members, fill the values with previous ones
df["second"] = df["second"].fillna(df["first"])
df["third"] = df["third"].fillna(df["second"])
# sort by the arrival position columns, then drop all the helper columns
df = df.sort_values(["first","second","third","pool"]).drop(columns=["first","second","third","order"])
It works, but it's not particularly scalable to pools with larger numbers of arrivals (20+). I'm convinced there has to be a better way, but I can't figure out how to do it.
I also tried combining the transform
and nth
functions as discussed in Using transform together with nth but, contrary to the accepted answer on that question, trying to pass "nth"
to groupby.transform
raises ValueError: 'nth' is not a valid function name for transform(name)
since nth
may return none or multiple values for a given group and transform can't handle those cases.