Pandas:在列上添加重复指示器

3

这里有一个包含A、B、C、D列的Pandas数据框

      A B  C    D   
    0 1 2 1.0   a   
    1 1 2 1.01  a   
    2 1 2 1.0   b   
    3 3 4 0     b   
    4 3 4 0     c   
    5 1 2 1     c   
    6 1 9 1     c   

如何添加一列来显示符合以下限制条件的其他行中的重复项:

  • A,B的完全匹配
  • C的浮点数公差(在0.05范围内)
  • 不能匹配D
      A B  C    D   Dups
    0 1 2 1.0   a   2,5
    1 1 2 1.01  a   2,5
    2 1 2 1.0   b   0,1,5
    3 3 4 0     b   4
    4 3 4 0     c   3
    5 1 2 1     c   0,1,2
    6 1 9 1     c   null
3个回答

1
我的原始答案需要进行N**2次迭代以处理N行。Sammywemmy的答案循环遍历permutations(..., 2),这实际上是一个循环遍历N*(N-1)个组合。Warped的答案更有效,因为它从A和B列开始快速匹配,但仍然需要在C和D列上进行缓慢的搜索。因此,迭代次数为N*M,其中M是共享相同A和B值的行的平均数量。
如果您愿意将“C等于+/-0.05”的要求更改为“四舍五入后C相等”,则效果会更好,迭代次数为N*K,其中K是具有相同A、B和C值的行的平均数量。以下是一种实现方法;您也可以调整warped的方法。
df = pd.DataFrame(
    {'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
     'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
     'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
     'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})

# alternative to "equal +/- 0.05"
df['C10'] = np.around(df['C']*10).astype('int')

# convert int64 tuples to int tuples
ituple = lambda tup: tuple(int(x) for x in tup)

# records : [(1, 2, 10), (1, 2, 100, (1, 2, 10), (3, 4,0), ...]
records = [ituple(rec) for rec in df[['A', 'B', 'C10']].to_records(index=False)]

# dupd: dict with records as key, list of indices as values.
# e.g. {(1, 2, 10): [0, 1, 2, 5], ...}
dupd = {} # key: ABC tuples; value: list of indices

# Build up dupd based on equal A, B, C columns.
for i, rec in enumerate(records):
    # each record is a tuple with integers; can be used as key in dict
    if rec in dupd:
        dupd[rec].append(i)
    else:
        dupd[rec] = [i]
        
# build duplicates for each row, remove the ones with equal D
dups = []
D = df['D']
for i, rec in enumerate(records):
    dup = [j for j in dupd[rec] if i!=j and D[i] != D[j]]
    dups.append(tuple(dup))
    
df.drop(columns=['C10'], inplace=True)
df['Dups'] = dups
        
print(df)

输出:

   A  B     C  D       Dups
0  1  2  1.00  a     (2, 5)
1  1  2  1.01  a     (2, 5)
2  1  2  1.00  b  (0, 1, 5)
3  3  4  0.00  b       (4,)
4  3  4  0.00  c       (3,)
5  1  2  1.00  c  (0, 1, 2)
6  1  9  1.00  c         ()

以下是原始答案,其时间复杂度为O(N**2),但易于理解:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {'A': {0: 1, 1: 1, 2: 1, 3: 3, 4: 3, 5: 1, 6: 1},
     'B': {0: 2, 1: 2, 2: 2, 3: 4, 4: 4, 5: 2, 6: 9},
     'C': {0: 1.0, 1: 1.01, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0},
     'D': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5: 'c', 6: 'c'}})


dups = []
for i, irow in df.iterrows():
    dup = []
    for j, jrow in df.iterrows():
        if (i != j and 
            irow['A'] == jrow['A'] and
            irow['B'] == jrow['B'] and 
            abs(irow['C']-jrow['C']) < 0.05 and
            irow['D'] != jrow['D']
            ):
            dup.append(j)
    dups.append(tuple(dup))
df['Dups'] = dups

print(df)

我不介意效率低一些 - 只是想添加一列来帮助一些研究...每周生成数据几次,这可能需要一些时间来生成。 对pandas不是特别熟悉 - 谢谢! - RML

0

这并不美观,但它确实完成了工作:

tolerance=0.05

dups={}
for _, group in df.groupby(['A', 'B']):
    for i, row1 in group.iterrows():
        
        data = []
        
        for j, row2 in group.iterrows():
            if i!=j:
                if abs(row1['C'] - row2['C']) <= tolerance:
                    if row1['D'] != row2['D']:
                        print(i,j)
                        data.append(j)
        
        dups[i] = data
        
dups = [dups.get(a) for a in range(len(dups.keys()))]
df['dups'] = dups

df

    A   B   C       D   dups
0   1   2   1.00    a   [2, 5]
1   1   2   1.01    a   [2, 5]
2   1   2   1.00    b   [0, 1, 5]
3   3   4   0.00    b   [4]
4   3   4   0.00    c   [3]
5   1   2   1.00    c   [0, 1, 2]
6   1   9   1.00    c   []

0

转换为字典:

res = df.T.to_dict("list")
res

{0: [1, 2, 1.0, 'a'],
 1: [1, 2, 1.01, 'a'],
 2: [1, 2, 1.0, 'b'],
 3: [3, 4, 0.0, 'b'],
 4: [3, 4, 0.0, 'c'],
 5: [1, 2, 1.0, 'c'],
 6: [1, 9, 1.0, 'c']}

将索引和值的配对放入每个子列表中:

box = [(key,*value) for key, value in res.items()]
box

[(0, 1, 2, 1.0, 'a'),
 (1, 1, 2, 1.01, 'a'),
 (2, 1, 2, 1.0, 'b'),
 (3, 3, 4, 0.0, 'b'),
 (4, 3, 4, 0.0, 'c'),
 (5, 1, 2, 1.0, 'c'),
 (6, 1, 9, 1.0, 'c')]

使用itertools permutations与您的条件一起过滤匹配项:

from itertools import permutations
phase1 = [(ind, (first, second),*_) for ind, first, second, *_ in box]

#can be refactored with something cleaner
phase2 = [((*first[1],*first[2:]), second[0]) 
          for first, second in permutations(phase1,2) 
          if first[1] == second[1] and second[2] - first[2] <= 0.05 and first[-1] != second[-1]
         ]
phase2

[((1, 2, 1.0, 'a'), 2),
 ((1, 2, 1.0, 'a'), 5),
 ((1, 2, 1.01, 'a'), 2),
 ((1, 2, 1.01, 'a'), 5),
 ((1, 2, 1.0, 'b'), 0),
 ((1, 2, 1.0, 'b'), 1),
 ((1, 2, 1.0, 'b'), 5),
 ((3, 4, 0.0, 'b'), 4),
 ((3, 4, 0.0, 'c'), 3),
 ((1, 2, 1.0, 'c'), 0),
 ((1, 2, 1.0, 'c'), 1),
 ((1, 2, 1.0, 'c'), 2)]

通过 defaultdict 获取配对:

from collections import defaultdict

    d = defaultdict(list)
    for k, v in phase2:
        d[k].append(v)
    
    d


defaultdict(list,
            {(1, 2, 1.0, 'a'): [2, 5],
             (1, 2, 1.01, 'a'): [2, 5],
             (1, 2, 1.0, 'b'): [0, 1, 5],
             (3, 4, 0.0, 'b'): [4],
             (3, 4, 0.0, 'c'): [3],
             (1, 2, 1.0, 'c'): [0, 1, 2]})

d 中的值组合成字符串:

e = [(*k,",".join(str(ent) for ent in v)) for k,v in d.items()]
e

[(1, 2, 1.0, 'a', '2,5'),
 (1, 2, 1.01, 'a', '2,5'),
 (1, 2, 1.0, 'b', '0,1,5'),
 (3, 4, 0.0, 'b', '4'),
 (3, 4, 0.0, 'c', '3'),
 (1, 2, 1.0, 'c', '0,1,2')]

从提取中创建数据框:

cols = df.columns.append(pd.Index(["Dups"]))
dups = pd.DataFrame(e, columns=cols)

与原始数据框合并:

result = df.merge(dups, how="left", on=["A", "B", "C", "D"])
result

    A   B   C       D   Dups
0   1   2   1.00    a   2,5
1   1   2   1.01    a   2,5
2   1   2   1.00    b   0,1,5
3   3   4   0.00    b   4
4   3   4   0.00    c   3
5   1   2   1.00    c   0,1,2
6   1   9   1.00    c   NaN

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接