我有两个数据框,想要根据df1
的words
列的值是否包含df2
的keywords
列的值来合并这两个数据框。我一直在尝试使用str.extract
,但是到目前为止还没有得到预期的结果。以下是示例:
df1:
[{'id': 1, 'words': 'chellomedia', 'languages': nan},
{'id': 2, 'words': 'Moien Welt!', 'languages': 'Luxemburgish'},
{'id': 3, 'words': 'Ahoj světe!', 'languages': 'Czech'},
{'id': 4, 'words': 'hello world', 'languages': nan},
{'id': 5, 'words': '¡Hola Mundo!', 'languages': 'Spanish'},
{'id': 6, 'words': 'hello kitty', 'languages': 'English'},
{'id': 7, 'words': 'Ciao mondo!', 'languages': 'Italian'},
{'id': 8, 'words': 'hola world', 'languages': nan}]
df2:
[{'code': 1, 'keywords': 'Hello'},
{'code': 2, 'keywords': 'hola'},
{'code': 3, 'keywords': 'world'}]
我的试用代码:
df1['words'] = df1['words'].str.lower()
df2['keywords'] = df2['keywords'].str.lower()
pat = '|'.join([re.escape(x) for x in df2.keywords])
df1.insert(0, 'keywords', df1['words'].str.extract('(' + pat + ')', expand=False))
pd.merge(df1, df2, on='keywords', how='left')
输出:
keywords id words languages code
0 hello 1 chellomedia NaN 1.0
1 NaN 2 moien welt! Luxemburgish NaN
2 NaN 3 ahoj světe! Czech NaN
3 hello 4 hello world NaN 1.0
4 hola 5 ¡hola mundo! Spanish 2.0
5 hello 6 hello kitty English 1.0
6 NaN 7 ciao mondo! Italian NaN
7 hola 8 hola world NaN 2.0
但是期望的应该像这样:
keywords id words languages code
0 hello 1 chellomedia NaN 1.0
1 NaN 2 moien welt! Luxemburgish NaN
2 NaN 3 ahoj světe! Czech NaN
3 hello 4 hello world NaN 1.0
4 world 4 hello world NaN 3.0 ---> should be generated in df
5 hola 5 ¡hola mundo! Spanish 2.0
6 hello 6 hello kitty English 1.0
7 NaN 7 ciao mondo! Italian NaN
8 hola 8 hola world NaN 2.0
9 world 8 hola world NaN 3.0 ---> should be generated in df
我该如何生成期望的结果?谢谢。