使用Python按最高出现次数对唯一值进行分组

3
以下是我的df样本。
name
A S BITO 
A S KIGEL 
A S NATURENERGI
A S NATURENERGIE 
A S NATURENERGIE 
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER PETER GMBH 
A S P GMBH  
A RESE LAND
A RITTER WITH SA
A RITTER WITH SA    
A RITTER WITH SA
A RITTER SA CO  
A RITTER SA CO  
A RITTER SA CO
A RITTER SA CO  
A RITTER WITH MASCHINE
A RITTER WITH MASCHINE SA CO 
A RITTER WITH MASCHINE SA CO 

目标是用出现次数最多的唯一值替换名称。
以下是唯一值列表。
name                                 occurences
A S BITO                             1
A S KIGEL                            1
A S NATURENERGI                      1
A S NATURENERGIE                     2
A S P BU SERVICE POWER P             2 
A S P BU SERVICE POWER PETER GMBH    1
A S P GMBH                           1
A RESE LAND                          1
A RITTER WITH SA                     3
A RITTER SA CO                       4
A RITTER WITH MASCHINE               1
A RITTER WITH MASCHINE SA CO         2

从DF中可以看到,有些名称可以进行分组。
但是由于拼写错误,有些名称无法分组。

期望的输出应该是这样的

name
A S BITO 
A S KIGEL 
A S NATURENERGIE
A S NATURENERGIE 
A S NATURENERGIE 
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P
A S P BU SERVICE POWER P 
A S P GMBH  
A RESE LAND
A RITTER SA CO  
A RITTER SA CO  
A RITTER SA CO
A RITTER SA CO  
A RITTER SA CO  
A RITTER SA CO
A RITTER SA CO  
A RITTER SA CO  
A RITTER SA CO  
A RITTER SA CO

下面是尝试过的代码

df['name'] = df['name'].replace('A S NATURENERGI', 'A S NATURENERGIE')
df['name'] = df['name'].replace('A S P BU SERVICE POWER PETER GMBH', 'A S P BU SERVICE POWER P')
df['name'] = df['name'].replace('A RITTER WITH SA', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE', 'A RITTER SA CO')
df['name'] = df['name'].replace('A RITTER WITH MASCHINE SA CO ', 'A RITTER SA CO')

然而,这可能不是处理它的最佳方式。
因此,我在考虑使用difflib并计算匹配得分。
下一步将是用匹配得分最高的名称替换名称。

f = partial(difflib.get_close_matches, possibilities= df['name'].tolist(), n=1) # 
matches = df['name'].map(f).str[0].fillna('')
scores = [difflib.SequenceMatcher(None, x, y).ratio() for x, y in zip(matches, df['name'])]
df_diff = df.assign(best=matches, score=scores)

这种方法的缺点是我会得到完全相同的名称......如果有人有什么想法,请多多指教!
1个回答

2

我创建了一个自定义函数,它在pandas系列中迭代映射:

import difflib

def similarity_replace(series):

    reverse_map = {}
    diz_map = {}
    for i,s in series.iteritems():
        diz_map[s] = s.replace(" ", "")
        reverse_map[s.replace(" ", "")] = s

    best_match = {}
    uni = list(set(diz_map.values()))
    for w in uni:
        best_match[w] = sorted(difflib.get_close_matches(w, uni, n=3, cutoff=0.6), key=len)[0]

    return series.map(diz_map).map(best_match).map(reverse_map)

以下是示例:

name = pd.Series(['A S BITO', 
'A S KIGEL',
'A S NATURENERGI',
'A S NATURENERGIE',
'A S NATURENERGIE',
'A S P BU SERVICE POWER P',
'A S P BU SERVICE POWER P',
'A S P BU SERVICE POWER PETER GMBH',
'A S P GMBH',
'A RESE LAND',
'A RITTER WITH SA',
'A RITTER WITH SA', 
'A RITTER WITH SA',
'A RITTER SA CO',
'A RITTER SA CO', 
'A RITTER SA CO',
'A RITTER SA CO',
'A RITTER WITH MASCHINE',
'A RITTER WITH MASCHINE SA CO', 
'A RITTER WITH MASCHINE SA CO'])

similarity_replace(similarity_replace(name))

输出:

0                     A S BITO
1                    A S KIGEL
2              A S NATURENERGI
3              A S NATURENERGI
4              A S NATURENERGI
5     A S P BU SERVICE POWER P
6     A S P BU SERVICE POWER P
7     A S P BU SERVICE POWER P
8                   A S P GMBH
9                  A RESE LAND
10              A RITTER SA CO
11              A RITTER SA CO
12              A RITTER SA CO
13              A RITTER SA CO
14              A RITTER SA CO
15              A RITTER SA CO
16              A RITTER SA CO
17              A RITTER SA CO
18              A RITTER SA CO
19              A RITTER SA CO

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接