我认为你可以使用
drop_duplicates
。
如果想要检查一些列并保留重复的第一行:
newDF = df2.drop_duplicates('student_name')
print(newDF)
student_name test_score
0 Miller 76.0
1 Jacobson 88.0
2 Ali 84.0
3 Milner 67.0
4 Cooze 53.0
5 Jacon 96.0
6 Ryaner 64.0
7 Sone 91.0
8 Sloan 77.0
9 Piger 73.0
10 Riani 52.0
并感谢@cᴏʟᴅsᴘᴇᴇᴅ提供的另一种解决方案:
df2[~df2.student_name.duplicated()]
但是如果想要一起检查所有列是否有重复,请保留第一个行:
newDF = df2.drop_duplicates()
print(newDF)
student_name test_score
0 Miller 76.0
1 Jacobson 88.0
2 Ali 84.0
3 Milner 67.0
4 Cooze 53.0
5 Jacon 96.0
6 Ryaner 64.0
7 Sone 91.0
8 Sloan 77.0
9 Piger 73.0
10 Riani 52.0
11 Ali NaN
由新样本编辑 - 删除重复项并按两列排序:
newDF = df2.drop_duplicates().sort_values(['student_name', 'test_score'])
print(newDF)
student_name test_score
2 Ali 74
1 Miller 75
0 Miller 76
编辑1:如果想将副本通过第一列替换为NaN
:
newDF = df2.drop_duplicates().sort_values(['student_name', 'test_score'])
newDF['student_name'] = newDF['student_name'].mask(newDF['student_name'].duplicated())
print(newDF)
student_name test_score
2 Ali 74
1 Miller 75
0 NaN 76
编辑2:更通用的解决方案:
newDF = df2.sort_values(df2.columns.tolist())
.reset_index(drop=True)
.apply(lambda x: x.drop_duplicates())
df2.stack().unique()
。 - cs95df2 = df2.drop_duplicates()
- cs95