pandas：如何限制str.contains的结果？

Question

pandas：如何限制str.contains的结果？

4

我有一个包含超过1M行的DataFrame。我想选择所有某一列包含特定子字符串的行：

matching = df['col2'].str.contains('substr', case=True, regex=False)
rows = df[matching].col1.drop_duplicates()

但是这个选择很慢，我想加快速度。假设我只需要前 n 个结果。有没有办法在获得 n 个结果后停止 匹配？我尝试过：

matching = df['col2'].str.contains('substr', case=True, regex=False).head(n)

并且：

matching = df['col2'].str.contains('substr', case=True, regex=False).sample(n)

但它们并没有更快。第二个语句是布尔型的，速度非常快。如何加快第一个语句的速度？

- Julio

2个回答

1

你可以用以下方法加速它：

matching = df['col2'].head(n).str.contains('substr', case=True, regex=False)
rows = df['col1'].head(n)[matching==True]

然而，这个解决方案将在前 n 行中检索匹配的结果，而不是前 n 个匹配的结果。

如果您确实想要前 n 个匹配的结果，您应该使用：

rows =  df['col1'][df['col2'].str.contains("substr")==True].head(n)

但是这个选项显然会更慢。

受 @ScottBoston 的答案启发，您可以使用以下方法来实现 完整的更快解决方案：

rows = df['col1'][pd.Series(['substr' in i for i in df['col2']])==True].head(n)

这种方法比使用整个结果稍微快一些，但并不是非常快。使用这种解决方案，您可以获得前 n 个匹配结果。

通过下面的测试代码，我们可以看到每个解决方案的速度和结果：

import pandas as pd
import time

n = 10
a = ["Result", "from", "first", "column", "for", "this", "matching", "test", "end"]
b = ["This", "is", "a", "test", "has substr", "also has substr", "end", "of", "test"]

col1 = a*1000000
col2 = b*1000000

df = pd.DataFrame({"col1":col1,"col2":col2})

# Original option
start_time = time.time()
matching = df['col2'].str.contains('substr', case=True, regex=False)
rows = df[matching].col1.drop_duplicates()
print("--- %s seconds ---" % (time.time() - start_time))

# Faster option
start_time = time.time()
matching_fast = df['col2'].head(n).str.contains('substr', case=True, regex=False)
rows_fast = df['col1'].head(n)[matching==True]
print("--- %s seconds for fast solution ---" % (time.time() - start_time))


# Other option
start_time = time.time()
rows_other =  df['col1'][df['col2'].str.contains("substr")==True].head(n)
print("--- %s seconds for other solution ---" % (time.time() - start_time))

# Complete option
start_time = time.time()
rows_complete = df['col1'][pd.Series(['substr' in i for i in df['col2']])==True].head(n)
print("--- %s seconds for complete solution ---" % (time.time() - start_time))

这将输出：

>>> 
--- 2.33899998665 seconds ---
--- 0.302999973297 seconds for fast solution ---
--- 4.56700015068 seconds for other solution ---
--- 1.61599993706 seconds for complete solution ---

而产生的系列将是：

>>> rows
4     for
5    this
Name: col1, dtype: object
>>> rows_fast
4     for
5    this
Name: col1, dtype: object
>>> rows_other
4      for
5     this
13     for
14    this
22     for
23    this
31     for
32    this
40     for
41    this
Name: col1, dtype: object
>>> rows_complete
4      for
5     this
13     for
14    this
22     for
23    this
31     for
32    this
40     for
41    this
Name: col1, dtype: object

- Cedric Zoppolo

2

这并没有真正回答我的问题。我一开始对限制搜索空间持怀疑态度：显然这会提高性能，但代价是结果的减少。然而，在尝试了您的“更快”的解决方案后，n=10000，结果还不错，时间上的改进也很明显。但最终，我不能部署这个“更快”的解决方案，因为它假设在前_n_个结果中会有匹配项，这可能不是真的！我将编辑我的问题以澄清这一点。 - Julio

是的，我想你想要前n个匹配项而不是在前n行中匹配。如果有任何改进时间的方法，我会检查并帮助您。也许@ScottBoston的答案是一个相当好的解决方案。 - Cedric Zoppolo

请注意，您的解决方案还会返回前n行中的匹配项。 - Cedric Zoppolo

没错。的确，你的“其他”解决方案返回了前n个匹配项，但它比根本不使用 .head()要慢，也就是不限制搜索。 - Julio

请看我的更新。我认为“完整解决方案”是一个相当不错的方法。 - Cedric Zoppolo

如果您认为此回答有用，请考虑给予点赞和/或接受答案。 - Cedric Zoppolo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Scott Boston · Accepted Answer

信不信由你，但是 .str 访问器很慢。你可以使用列表推导式来获得更好的性能。

df = pd.DataFrame({'col2':np.random.choice(['substring','midstring','nostring','substrate'],100000)})

测试相等性

all(df['col2'].str.contains('substr', case=True, regex=False) ==
    pd.Series(['substr' in i for i in df['col2']]))

输出：

True

时间：

%timeit df['col2'].str.contains('substr', case=True, regex=False)
10 loops, best of 3: 37.9 ms per loop

对比

%timeit pd.Series(['substr' in i for i in df['col2']])
100 loops, best of 3: 19.1 ms per loop