Pandas - 查找两列具有匹配值的行，并乘以另一列中的值

Question

Pandas - 查找两列具有匹配值的行，并乘以另一列中的值

3

首先，假设我们有以下数据框：

import pandas as pd
data = pd.DataFrame({'id':['1','2','3','4','5','6','7','8'], 
                     'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],  
                     'C':['10','10','10','30','50','60','50','8'], 
                     'D':['9','8','7','6','5','4','3','2']})
print(data)

    A   C   D   id
0   foo 10  9   1
1   bar 10  8   2
2   foo 10  7   3
3   bar 30  6   4
4   foo 50  5   5
5   bar 60  4   6
6   foo 50  3   7
7   foo 8   2   8

我希望你能做的是找到匹配的行并进行一些计算。

for any two ids(idx, idy) in data.iterrows():
       if idx.A == idy.A and idx.C = idy.C:
       result = idx.D * idy.D

然后生成一个包含三列['id'], ['A'] 和 ['result'] 的新数据帧。

因此，期望结果的几行是：

     id   A   result   
0    1   foo   63   
1    3   foo   63   
2    5   foo   15
3    7   foo   15

我已经尝试过，但结果要么是错误的逻辑，要么是错误的代码/数据格式。有人可以帮我吗？

- Alex12346

我的答案有效吗？ - ababuji

3个回答

1

您可以使用自连接技术：

self-join

data[['id', 'C', 'D']] = data[['id', 'C', 'D']].apply(pd.to_numeric)
joint = pd.merge(data, data, on=('A', 'C'))
joint = joint.loc[join['id_x'] != join['id_y']]
joint['result'] = joint['D_x'] * joint['D_y']
result = joint[['id_x', 'A', 'result']]
result.columns = ['id', 'A', 'result']

Result:

   id    A  result
1   1  foo      63
2   3  foo      63
7   5  foo      15
8   7  foo      15

- Lev Zakharov

1

更好、更快的方式是 joint['result'] = joint['D_x'] * joint['D_y']，而不是 joint['result'] = joint.apply(lambda x: x['D_x'] * x['D_y'], axis=1)。 - jezrael

可能也可以这样写：joint = pd.merge(data, data, on=('A', 'C'))[lambda r: r.id_x != r.id_y] 但这样可能会更慢... - Jon Clements

我得到了“function”对象不可索引的错误，并且无法运行。 - user8864088

先生，当我在一个大数据集上尝试时，您的答案会导致内存错误。您有什么解决方案吗？@Lev Zakharov - Alex12346

0

import pandas as pd
data = pd.DataFrame({'id':['1','2','3','4','5','6','7','8'], 
                     'A':['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],  
                     'C':['10','10','10','30','50','60','50','8'], 
                     'D':['9','8','7','6','5','4','3','2']})

首先将相关列转换为数字格式

data[['C', 'D', 'id']] = data[['C', 'D', 'id']].apply(pd.to_numeric)

创建一个空的DataFrame以便追加数据。

finalDataFrame = pd.DataFrame()

对两列进行 groupby，然后在组内找到列 D 的乘积并将其附加。

group = data.groupby(['A', 'C'])
for x, y in group:


    product = (y[["D"]].product(axis=0).values[0])


    for row in y.index:
        y.at[row, 'D'] = product

    finalDataFrame = finalDataFrame.append(y, ignore_index=True)

output = finalDataFrame[['id', 'A', 'D']]
output = output.rename(columns = {'D': 'result'})
print(output)

提供给你

   id    A  result
0   2  bar       8
1   4  bar       6
2   6  bar       4
3   8  foo       2
4   1  foo      63
5   3  foo      63
6   5  foo      15
7   7  foo      15

- ababuji

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jon Clements · Accepted Answer

一种方法是按A + C分组，取乘积和计数，过滤掉只有一个项目的组，然后再按A + C进行内部合并到原始数据框中，例如：

df.merge(
    df.groupby(['A', 'C']).D.agg(['prod', 'count'])
    [lambda r: r['count'] > 1],
    left_on=['A', 'C'],
    right_index=True
)

提供给您：

     A   C  D  id  prod  count
0  foo  10  9   1    63      2
2  foo  10  7   3    63      2
4  foo  50  5   5    15      2
6  foo  50  3   7    15      2

然后根据需要删除/重命名列。