转换为字典:
res = df.T.to_dict("list")
res
{0: [1, 2, 1.0, 'a'],
1: [1, 2, 1.01, 'a'],
2: [1, 2, 1.0, 'b'],
3: [3, 4, 0.0, 'b'],
4: [3, 4, 0.0, 'c'],
5: [1, 2, 1.0, 'c'],
6: [1, 9, 1.0, 'c']}
将索引和值的配对放入每个子列表中:
box = [(key,*value) for key, value in res.items()]
box
[(0, 1, 2, 1.0, 'a'),
(1, 1, 2, 1.01, 'a'),
(2, 1, 2, 1.0, 'b'),
(3, 3, 4, 0.0, 'b'),
(4, 3, 4, 0.0, 'c'),
(5, 1, 2, 1.0, 'c'),
(6, 1, 9, 1.0, 'c')]
使用itertools permutations与您的条件一起过滤匹配项:
from itertools import permutations
phase1 = [(ind, (first, second),*_) for ind, first, second, *_ in box]
phase2 = [((*first[1],*first[2:]), second[0])
for first, second in permutations(phase1,2)
if first[1] == second[1] and second[2] - first[2] <= 0.05 and first[-1] != second[-1]
]
phase2
[((1, 2, 1.0, 'a'), 2),
((1, 2, 1.0, 'a'), 5),
((1, 2, 1.01, 'a'), 2),
((1, 2, 1.01, 'a'), 5),
((1, 2, 1.0, 'b'), 0),
((1, 2, 1.0, 'b'), 1),
((1, 2, 1.0, 'b'), 5),
((3, 4, 0.0, 'b'), 4),
((3, 4, 0.0, 'c'), 3),
((1, 2, 1.0, 'c'), 0),
((1, 2, 1.0, 'c'), 1),
((1, 2, 1.0, 'c'), 2)]
通过 defaultdict 获取配对:
from collections import defaultdict
d = defaultdict(list)
for k, v in phase2:
d[k].append(v)
d
defaultdict(list,
{(1, 2, 1.0, 'a'): [2, 5],
(1, 2, 1.01, 'a'): [2, 5],
(1, 2, 1.0, 'b'): [0, 1, 5],
(3, 4, 0.0, 'b'): [4],
(3, 4, 0.0, 'c'): [3],
(1, 2, 1.0, 'c'): [0, 1, 2]})
将 d
中的值组合成字符串:
e = [(*k,",".join(str(ent) for ent in v)) for k,v in d.items()]
e
[(1, 2, 1.0, 'a', '2,5'),
(1, 2, 1.01, 'a', '2,5'),
(1, 2, 1.0, 'b', '0,1,5'),
(3, 4, 0.0, 'b', '4'),
(3, 4, 0.0, 'c', '3'),
(1, 2, 1.0, 'c', '0,1,2')]
从提取中创建数据框:
cols = df.columns.append(pd.Index(["Dups"]))
dups = pd.DataFrame(e, columns=cols)
与原始数据框合并:
result = df.merge(dups, how="left", on=["A", "B", "C", "D"])
result
A B C D Dups
0 1 2 1.00 a 2,5
1 1 2 1.01 a 2,5
2 1 2 1.00 b 0,1,5
3 3 4 0.00 b 4
4 3 4 0.00 c 3
5 1 2 1.00 c 0,1,2
6 1 9 1.00 c NaN