如何使用pandas基于特定列删除重复值？

Question

如何使用pandas基于特定列删除重复值？

3

目前，我已从Excel导入以下数据帧到pandas中，并且想要根据两列的值删除重复的值。

# Python 3.5.2
# Pandas library version 0.22

import pandas as pd 

# Save the Excel workbook in a variable
current_workbook  = pd.ExcelFile('C:\\Users\\userX\\Desktop\\cost_values.xlsx')

# convert the workbook to a data frame
current_worksheet = pd.read_excel(current_workbook, index_col = 'vend_num') 

# current output
print(current_worksheet)


| vend_number |      vend_name         | quantity |  source  |
| ----------- |----------------------- | -------- | -------- | 
    CHARLS      Charlie & Associates      $5,700.00   Central
    CHARLS      Charlie & Associates      $5,700.00   South
    CHARLS      Charlie & Associates      $5,700.00   North
    CHARLS      Charlie & Associates      $5,700.00   West
    HUGHES      Hughinos                  $3,800.00   Central
    HUGHES      Hughinos                  $3,800.00   South
    FERNAS      Fernanda Industries       $3,500.00   South
    FERNAS      Fernanda Industries       $3,500.00   North
    FERNAS      Fernanda Industries       $3,000.00   West
    ....

我想要的是基于数量和来源列删除重复值：

检查数量和来源列的值：

1.1 如果供应商的数量等于同一供应商的另一行，且来源不等于Central，则删除来自此供应商的重复行，除了Central行。

1.2 否则，如果供应商的数量在同一供应商的另一行中相等，并且没有Central来源，则删除重复行。

期望的结果

| vend_number |      vend_name         | quantity |  source  |
| ----------- |----------------------- | -------- | -------- | 
    CHARLS      Charlie & Associates      $5,700.00   Central
    HUGHES      Hughinos                  $3,800.00   Central
    FERNAS      Fernanda Industries       $3,500.00   South
    FERNAS      Fernanda Industries       $3,000.00   West
    ....

到目前为止，我已经尝试了以下代码，但是 Pandas 没有检测到任何重复行。

print(current_worksheet.loc[current_worksheet.duplicated()])
print(current_worksheet.duplicated())

我尝试解决这个问题，但在这个问题上遇到了一些困难，因此非常感谢对这个问题提供的任何帮助。请随意改进这个问题。

- abautista

对于Fernanda Industries的3500美元，你如何在南方和北方之间做出选择？ - jpp

1

取遇到的第一行，这种情况下是南。 - abautista

2个回答

1

你可以分两步完成它。

s=df.loc[df['source']=='Central',:]
t=df.loc[~df['vend_number'].isin(s['vend_number']),:]

pd.concat([s,t.drop_duplicates(['vend_number','quantity'],keep='first')])

- BENY

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jpp · Accepted Answer

以下是其中一种方法。

df['CentralFlag'] = (df['source'] == 'Central')

df = df.sort_values('CentralFlag', ascending=False)\
       .drop_duplicates(['vend_name', 'quantity'])\
       .drop('CentralFlag', 1)

#   vend_number           vend_name   quantity   source
# 0      CHARLS  Charlie&Associates  $5,700.00  Central
# 4      HUGHES            Hughinos  $3,800.00  Central
# 6      FERNAS  FernandaIndustries  $3,500.00    South
# 8      FERNAS  FernandaIndustries  $3,000.00     West

解释

创建一个标志列，按照该列进行降序排序，这样中央区域将具有优先权。
按 vend_name 和 quantity 进行排序，然后删除标志列。