在Python中，在文本体中搜索多个项的最快方法

Question

在Python中，在文本体中搜索多个项的最快方法

3

我有一个短字符串的长列表，并且我想要在通常长度为~10,000个字符的文本中搜索所有这些项目。我的列表有大约500个短字符串，我希望使用Python在源文本中找到所有出现的短字符串。

这是我的问题的简短示例：

cleanText = "four score and seven years ago our fathers brought forth on this continent a new nation conceived in Liberty and dedicated to the proposition that all men are created equal"
searchList = ["years ago","dedicated to","civil war","brought forth"]

我目前查找在cleanText中出现的searchList项的方法是：

found = [phrase for phrase in searchList if phrase in cleanText]

这是Python中最快的方法吗？虽然不算慢，但在大规模操作时（例如搜索列表中有500个项目，每个项目的cleanText都有10,000个字符），速度似乎比我想象的要慢一些。

- user1521440

你的内容是否具有持久性？你能使用全文索引解决方案吗？ - user2665694

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mgilson · Accepted Answer

您可以尝试使用正则表达式。对于大型列表，这可能会提高速度。

import re
found = re.findall('|'.join(searchList),cleanText)

（当然，这假设在searchList中没有任何需要为了re的目的而进行转义的内容。）

如评论中所指出的（感谢anijhaw），您可以通过以下方式进行转义：

found = re.findall('|'.join(re.escape(x) for x in searchList), cleanText)

如果您需要多次使用正则表达式，则可以使用re.compile进行预编译，例如：

regex = re.compile('|'.join(re.escape(x) for x in searchList))
found = regex.findall(cleanText)

声明这些解决方案只能找到非重叠匹配项。