检查数据框的每个值是否包含另一个数据框列中的单词。

Question

检查数据框的每个值是否包含另一个数据框列中的单词。

3

如何遍历一个数据框的每个值并检查是否包含另一个数据框列中的单词？

a = pd.DataFrame({'text': ['the cat jumped over the hat', 'the pope pulled on the rope', 'i lost my dog in the fog']})
b = pd.DataFrame({'dirty_words': ['cat', 'dog', 'parakeet']})

a    
    text
0   the cat jumped over the hat
1   the pope pulled on the rope
2   i lost my dog in the fog

b
    dirty_words
0   cat
1   dog
2   parakeet

我想获取一个只包含这些值的新数据框：

result

0   the cat jumped over the hat
1   i lost my dog in the fog

- silverSuns

3个回答

3

你可以使用列表推导式，在按空格拆分字符串后使用any。这种方法不会仅因为包含“cat”而包含“catheter”。

mask = [any(i in words for i in b['dirty_words'].values) \
        for words in a['text'].str.split().values]

print(a[mask])

                          text
0  the cat jumped over the hat
2     i lost my dog in the fog

- jpp

3

我认为你可以在str.split之后使用isin

a[pd.DataFrame(a.text.str.split().tolist()).isin(b.dirty_words.tolist()).any(1)]
Out[380]: 
                          text
0  the cat jumped over the hat
2     i lost my dog in the fog

- BENY

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- cs95 · Accepted Answer

使用 str.contains 进行正则表达式匹配。

p = '|'.join(b['dirty_words'].dropna())
a[a['text'].str.contains(r'\b{}\b'.format(p))]

                          text
0  the cat jumped over the hat
2     i lost my dog in the fog

单词边界可以确保您不会因为它包含“cat”而匹配到“catch”（感谢@DSM）。