在pandas的GroupBy中获取对应于最大值的行

Question

8

简单的数据框：

df = pd.DataFrame({'A': [1,1,2,2], 'B': [0,1,2,3], 'C': ['a','b','c','d']})
df
   A  B  C
0  1  0  a
1  1  1  b
2  2  2  c
3  2  3  d

我希望对于列A的每个值（groupby），获取列B最大的列C的值。例如，对于列A的1组，列B的最大值为1，因此我想要列C的值为“b”：

   A  C
0  1  b
1  2  d

不需要假设B列已排序，性能是最重要的，然后才是优雅。

- Giora Simchoni

4个回答

7

df.groupby('A').apply(lambda x: x.loc[x['B'].idxmax(), 'C'])
#    A
#1    b
#2    d

使用idxmax函数查找B最大值所在的索引，然后选择该组中的C列（使用lambda函数）。

- Jondiedoop

5

下面是有关 groupby 和 nlargest 的一些有趣内容：

(df.set_index('C')
   .groupby('A')['B']
   .nlargest(1)
   .index
   .to_frame()
   .reset_index(drop=True))

   A  C
0  1  b
1  2  d

或者使用 sort_values、groupby 和 last 函数：

df.sort_values('B').groupby('A')['C'].last().reset_index()

   A  C
0  1  b
1  2  d

- cs95

2

类似于@Jondiedoop的解决方案，但避免使用apply:

u = df.groupby('A')['B'].idxmax()

df.loc[u, ['A', 'C']].reset_index(drop=1)

   A  C
0  1  b
1  2  d

- user3483203

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- BENY · Accepted Answer

9

使用 sort_values +drop_duplicates 检查。

df.sort_values('B').drop_duplicates(['A'],keep='last')
Out[127]: 
   A  B  C
1  1  1  b
3  2  3  d

- BENY

1

那真是令人印象深刻，我必须得说。 - Giora Simchoni

1

接受此答案，因为根据 timeit 的测试结果，它比 @coldspeed 的答案快了 0.0002 秒 [

np.mean(timeit.repeat("df.sort_values('B').drop_duplicates(['A'],keep='last')", number = 1, repeat = 100, globals = globals()))

]。 - Giora Simchoni

1

@GioraSimchoni 感谢您的公正考虑和及时安排！ - cs95

这太棒了！ - Aman Singh