在pandas数据框中查找所有重复的行

Question

在pandas数据框中查找所有重复的行

3

我希望能够获取数据集中所有重复行的索引，而不需要事先知道列的名称和数量。假设我有以下数据：

我希望能够获取[1, 3, 4]和[2, 5]，有什么方法可以实现吗？这听起来很简单，但由于我事先不知道列名，因此无法像df[col == x...]那样做。

- Nico

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

首先过滤掉所有duplicated的行，然后使用groupby和apply函数，或者将index转换为to_series类型：

df = df[df.col.duplicated(keep=False)]

a = df.groupby('col').apply(lambda x: list(x.index))
print (a)
col
1    [1, 3, 4]
2       [2, 5]
dtype: object

a = df.index.to_series().groupby(df.col).apply(list)
print (a)
col
1    [1, 3, 4]
2       [2, 5]
dtype: object

如果需要嵌套列表：

L = df.groupby('col').apply(lambda x: list(x.index)).tolist()
print (L)
[[1, 3, 4], [2, 5]]

如果只需要使用第一列，可以使用iloc按位置进行选择：

a = df[df.iloc[:,0].duplicated(keep=False)]
      .groupby(df.iloc[:,0]).apply(lambda x: list(x.index))
print (a)
col
1    [1, 3, 4]
2       [2, 5]
dtype: object