Python Pandas去重保留倒数第二个

Question

Python Pandas去重保留倒数第二个

15

如何在 Pandas 数据框中选择每个重复集合中倒数第二个元素的最有效方法？

例如，我想执行以下操作：

df = df.drop_duplicates(['Person','Question'],take_last=True)

但这个：

df = df.drop_duplicates(['Person','Question'],take_second_last=True)

提炼的问题：如果重复项既不是最大值也不是最小值，如何选择要保留哪个重复项？

- David Yang

识别列的数据是什么样的？ - SO44

1

如果重复是实际上的重复，为什么倒数第二个会有影响；否则它们就不是重复。 - Merlin

3

我的理解是，Pandas 识别重复项时只考虑“人”和“问题”这两列，因此所有其它列可能具有区分值。 - Paul H

2个回答

5

你可以使用 groupby/tail(2) 来获取最后两个元素，然后使用 groupby/head(1) 从最后两个元素中获取第一个元素：

df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)

如果组中只有一个项目，则tail(2)仅返回该项。

例如，

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(10, size=(10**2, 3)), columns=list('ABC'))
result = df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)

expected = (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
assert expected.sort_index().equals(result)

内置的分组方法（如tail和head）通常比使用自定义Python函数的groupby/apply要快得多。如果有很多组，尤其如此：

In [96]: %timeit df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
1000 loops, best of 3: 1.7 ms per loop

In [97]: %timeit (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
100 loops, best of 3: 17.9 ms per loop

另外， Ayhan建议一项不错的改进：

alt = df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
assert expected.sort_index().equals(alt)

In [99]: %timeit df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
1000 loops, best of 3: 1.43 ms per loop

- unutbu

1

.drop_duplicates(['A', 'B']) 比 .groupby(['A','B']).head(1) 稍微快一些。 - ayhan

@ayhan：感谢您的改进！ - unutbu

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ayhan · Accepted Answer

使用groupby.apply:

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 
                   'B': np.arange(10), 'C': np.arange(10)})

df
Out: 
   A  B  C
0  1  0  0
1  1  1  1
2  1  2  2
3  1  3  3
4  2  4  4
5  2  5  5
6  2  6  6
7  3  7  7
8  3  8  8
9  4  9  9

(df.groupby('A', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
   .reset_index(level=0, drop=True))
Out: 
   A  B  C
2  1  2  2
5  2  5  5
7  3  7  7
9  4  9  9

使用不同的 DataFrame，筛选出两列：

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 
                   'B': [1, 1, 2, 1, 2, 2, 2, 3, 3, 4], 'C': np.arange(10)})

df
Out: 
   A  B  C
0  1  1  0
1  1  1  1
2  1  2  2
3  1  1  3
4  2  2  4
5  2  2  5
6  2  2  6
7  3  3  7
8  3  3  8
9  4  4  9

(df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
   .reset_index(level=0, drop=True))
Out: 
   A  B  C
1  1  1  1
2  1  2  2
5  2  2  5
7  3  3  7
9  4  4  9