pandas数据框str.contains()与操作

Question

pandas数据框str.contains()与操作

53

我有一个包含三行数据的df（Pandas数据框）：

some_col_name
"apple is delicious"
"banana is delicious"
"apple and banana both are delicious"

df.col_name.str.contains("apple|banana")函数将捕获所有行：

"apple is delicious",
"banana is delicious",
"apple and banana both are delicious".

如何在str.contains()方法中使用AND运算符，以便仅获取同时包含“apple”和“banana”的字符串？

"apple and banana both are delicious"

我想获取包含10-20个不同单词（葡萄，西瓜，浆果，橙子，...等）的字符串。

- aerin

1

这个例子只是玩具，因为你只有K=2个子字符串，并且它们按顺序出现：apple，banana。但你真正想要的是一种方法，可以匹配K=10-20个任意顺序的子字符串。使用多个前瞻断言的正则表达式是正确的方法（@Anzel的解决方案）。 - smci

10个回答

46

您也可以使用正则表达式的方式进行操作：

df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]

然后，您可以将单词列表构建成正则表达式字符串，如下所示：

base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat']  # example
base.format(''.join(expr.format(w) for w in words))

将呈现：

'^(?=.*apple)(?=.*banana)(?=.*cat)'

然后你可以动态地完成你的工作。

- Anzel

这太棒了。我尝试使用f-strings来完成它。结果是这样的，你有什么改进吗？filter_string = '^' + ''.join(fr'(?=.*{w})' for w in words) - spen.smith

1

@spen.smith 我认为你的实现很清晰简单；除非你遇到问题，否则不需要进一步改进它。 - Anzel

1

Anzel的解决方案很可靠。然而，'^(?=.*apple)(?=.*banana)'可以正常工作，但如果不知道apple和banana出现的顺序，则可能需要进行修改。当顺序未知时，可以使用类似于此表达式：'^(?=.*apple)(?=.*banana)|^(?=.*banana)(?=.*apple)'。另外，我会删除^以使其在字符串中搜索任何位置，而不仅仅是在开头。 - seakyourpeak

37

df = pd.DataFrame({'col': ["apple is delicious",
                           "banana is delicious",
                           "apple and banana both are delicious"]})

targets = ['apple', 'banana']

# Any word from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: any(word in sentence for word in targets))
0    True
1    True
2    True
Name: col, dtype: bool

# All words from `targets` are present in sentence.
>>> df.col.apply(lambda sentence: all(word in sentence for word in targets))
0    False
1    False
2     True
Name: col, dtype: bool

- Alexander

1

这个解决方案相比于df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]更加灵活，但需要评估与正则表达式相比所需的时间。 - seakyourpeak

确实非常Pythonic！ - quest

13

这有效。

df.col.str.contains(r'(?=.*apple)(?=.*banana)',regex=True)

- Charan Reddy

6

如果您只想使用本地方法并避免编写正则表达式，这里有一个向量化的版本，没有涉及lambda表达式：

targets = ['apple', 'banana', 'strawberry']
fruit_masks = (df['col'].str.contains(string) for string in targets)
combined_mask = np.vstack(fruit_masks).all(axis=0)
df[combined_mask]

- Sergey Zakharov

4

尝试使用这个正则表达式

apple.*banana|banana.*apple

代码为：

import pandas as pd

df = pd.DataFrame([[1,"apple is delicious"],[2,"banana is delicious"],[3,"apple and banana both are delicious"]],columns=('ID','String_Col'))

print df[df['String_Col'].str.contains(r'apple.*banana|banana.*apple')]

输出

   ID                           String_Col
2   3  apple and banana both are delicious

- pmaniyan

3

如果你想至少捕捉句子中的两个单词，也许这个方法可以奏效（参考自@Alexander的提示）：

target=['apple','banana','grapes','orange']
connector_list=['and']
df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (all(connector in sentence for connector in connector_list)))]

输出：

                                   col
2  apple and banana both are delicious

如果你有两个以上用逗号','隔开的词需要匹配，那么将它们添加到连接器列表中，并将第二个条件从“全部”修改为“任意”。

df[df.col.apply(lambda sentence: (any(word in sentence for word in target)) & (any(connector in sentence for connector in connector_list)))]

输出：

                                        col
2        apple and banana both are delicious
3  orange,banana and apple all are delicious

- Siraj S.

3

您可以创建遮罩层

apple_mask = df.colname.str.contains('apple')
bannana_mask = df.colname.str.contains('bannana')
df = df [apple_mask & bannana_mask]

- Vaibhav Gupta

3

枚举大型列表的所有可能性是很麻烦的。更好的方法是使用 reduce() 和按位与运算符 (&)。

例如，考虑以下DataFrame：

df = pd.DataFrame({'col': ["apple is delicious",
                       "banana is delicious",
                       "apple and banana both are delicious",
                       "i love apple, banana, and strawberry"]})

#                                    col
#0                    apple is delicious
#1                   banana is delicious
#2   apple and banana both are delicious
#3  i love apple, banana, and strawberry

假设我们想要搜索以下所有内容：

targets = ['apple', 'banana', 'strawberry']

我们可以做：

#from functools import reduce  # needed for python3
print(df[reduce(lambda a, b: a&b, (df['col'].str.contains(s) for s in targets))])

#                                    col
#3  i love apple, banana, and strawberry

- pault

1

从@Anzel的回答中，我写了一个函数，因为我将经常应用它：

def regify(words, base=str(r'^{}'), expr=str('(?=.*{})')):
    return base.format(''.join(expr.format(w) for w in words))

所以如果你已经定义了words:

words = ['apple', 'banana']

然后使用类似以下的方式调用它：

dg = df.loc[
    df['col_name'].str.contains(regify(words), case=False, regex=True)
]

那么你应该得到你想要的东西。

- Jonny

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- flyingmeatball · Accepted Answer

您可以按以下步骤操作：

df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]