将pandas列的元素与另一个pandas数据框的列进行匹配

3

我有一个名为A的pandas数据帧,其中列keywords如下:

 keywords
 ['loans','mercedez','bugatti','a4']
 ['trump','usa','election','president']
 ['galaxy','7s','canon','macbook']
 ['beiber','spiderman','marvels','ironmen']
 .........................................
 .........................................
 .........................................

我还有另一个pandas dataframe B,其中包含列categorywords,其为逗号分隔的字符串,如下所示:

category              words
audi                  audi a4,audi a6
bugatti               bugatti veyron, bugatti chiron
mercedez              mercedez s-class, mercedez e-class
dslr                  canon, nikon
apple                 iphone 7s,iphone 6s,iphone 5
finance               sales,loans,sales price
politics              donald trump, election, votes
entertainment         spiderman,captain america, ironmen
music                 justin beiber, rihana,drake
........              ..............
.........             .........

我想将数据框 A的列关键词数据框 B的列单词进行映射,并分配相应的类别。将关键词列映射到每个字符串中的单词,例如:关键词a4应与列单词中字符串audi a4中的两个单词匹配。预期结果应为:

  keywords                                       matched_category
  ['loans','mercedez','bugatti','a4']            ['finance','mercedez','mercedez','bugatti','bugatti','audi']                                    
  ['trump','usa','election','president']         ['politics','politics']                                           
  ['galaxy','7s','canon','macbook']              ['apple','dslr']
  ['beiber','spiderman','marvels','ironmen']     ['music','entertaiment','entertainment','entertainment']

从我所看到的,你的大部分单词和关键词都有一些重叠。你应该能够处理好这个问题。 - Kwright02
@Kwright02 在映射关键字之后,我也想要去除重复的内容。 - Learner
使用二维数组遍历单词集合,如果在任何时候 list[i].equals(list[j]),则删除其中一个,但确保 J 不是与 I 相同的索引。 - Kwright02
2个回答

0
一种方法是使用pandas.transform函数:
import pandas as pd

A = pd.DataFrame({'keywords': [['loans','mercedez','bugatti','a4'],
                           ['trump','usa','election','president']]})
B = pd.DataFrame({'category': ['audi', 'finance'],
                  'words': ['audi a4,audi a6', 'sales,loans,sales price']})

def match_category_to_keywords(kws):
    ret = []
    for kw in kws:
        m = B['words'].transform(lambda words: any([kw in w for w in words.split(',')]))
        ret.extend(B['category'].loc[m].tolist())
    return pd.np.unique(ret)

A['matched_category'] = A['keywords'].transform(lambda kws: match_category_to_keywords(kws))
print(A)

输出:

                            keywords matched_category
0     [loans, mercedez, bugatti, a4]  [audi, finance]
1  [trump, usa, election, president]               []

这完全与预期输出不符。你如何为数据框B添加多个类别? - James
列表中的每个条目代表B中的一行。在上面的示例中,我只添加了2行您的数据。如果您添加所有行,您将获得预期的输出。 - Ghasem Naddaf

0

希望你能使用:

#create dictionary by split comma and whitespaces
d = df2.set_index('category')['words'].str.split(',\s*|\s+').to_dict()
#flatten lists to dictionary
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'audi': 'audi', 'a4': 'audi', 'a6': 'audi', 'bugatti': 'bugatti', 
 'veyron': 'bugatti', 'chiron': 'bugatti', 'mercedez': 'mercedez', 
 's-class': 'mercedez', 'e-class': 'mercedez', 'canon': 'dslr', 
 'nikon': 'dslr', 'iphone': 'apple', '7s': 'apple', '6s': 'apple',
 '5': 'apple', 'sales': 'finance', 'loans': 'finance', 'price': 'finance', 
 'donald': 'politics', 'trump': 'politics', 'election': 'politics', 
 'votes': 'politics', 'spiderman': 'entertainment', 'captain': 'entertainment',
 'america': 'entertainment', 'ironmen': 'entertainment', 'justin': 'music', 
 'beiber': 'music', 'rihana': 'music', 'drake': 'music'}

#for each value map in nested list comprehension
df1['new'] = [[d1.get(y, None) for y in x if y in d1] for x in df1['keywords']]
print (df1)
                                keywords  \
0         [loans, mercedez, bugatti, a4]   
1      [trump, usa, election, president]   
2           [galaxy, 7s, canon, macbook]   
3  [beiber, spiderman, marvels, ironmen]   

                                     new  
0     [finance, mercedez, bugatti, audi]  
1                   [politics, politics]  
2                          [apple, dslr]  
3  [music, entertainment, entertainment]  

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接