Python中更快的去除停用词的方法

Question

Python中更快的去除停用词的方法

60

我正在尝试从一段文本中移除停用词：

from nltk.corpus import stopwords
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])

我正在处理600万个这样的字符串，因此速度非常重要。通过对我的代码进行分析，发现最慢的部分是上面的几行代码，请问有更好的方法吗？我在考虑使用诸如正则表达式的re.sub，但我不知道如何编写一组单词的模式。是否可以有人帮助我，我也很乐意听取其他可能更快的方法。

注意：我尝试了有人建议的使用set()包装stopwords.words('english')，但没有任何差异。

谢谢。

- mchangun

stopwords.words('english')有多大？ - Steve Barnes

@SteveBarnes 一个包含127个单词的列表 - mchangun

3

你是把它包含在列表推导式里还是在外面？尝试添加 stw_set = set(stopwords.words('english')) 并使用这个对象。 - alko

1

@alko 我以为我已经把它包起来了，但是我刚刚再试了一次，现在我的代码至少运行快了10倍!!! - mchangun

你是逐行处理文本还是一次性全部处理？ - Leonardo.Z

显示剩余2条评论

6个回答

31

抱歉回复晚了。这对新用户可能会有帮助。

使用collections库创建停用词字典
使用该字典进行非常快速的搜索（时间=O（1）），而不是在列表上执行搜索（时间=O（停用词数））

from collections import Counter
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)
text = ' '.join([word for word in text.split() if word not in stopwords_dict])

- Gulshan Jangid

这确实显著加快了处理速度，即使与基于正则表达式的方法相比也不例外。 - Diego

1

这确实是一个很好的答案，我希望它能够更受欢迎。从一个包含20k个项目的列表中删除文本单词时，使用Counter只需要20秒，而常规方法需要超过1小时，这真是令人难以置信的快速。 - mrbTT

你能解释一下 'Counter' 如何加速这个过程吗？@Gulshan Jangid - Karan Bari

3

以上代码之所以快速，主要原因是我们在搜索一个字典，它基本上是哈希映射。在哈希映射中，搜索时间为O（1）。除此之外，Counter 是 collections 库的一部分，该库是用 C 编写的，而由于 C 比 Python 更快，因此 Counter 比使用 Python 写的类似代码更快。 - Gulshan Jangid

刚刚测试了一下，这个方法比正则表达式的方法快了平均3倍。这是一个简单而有创意的解决方案，目前是最好的方法。 - Julio Cezar Silva

3

使用 collections.Counter(stopwords.words('english')) 不可能比使用 set(stopwords.words('english')) 更快。我相信，collections.Counter 方法只会不必要地使用更多的内存。 - mikey

25

使用正则表达式删除不匹配的所有单词：

import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)

这可能比自己循环要快得多，特别是对于大型输入字符串。

如果文本中的最后一个单词被删除，则可能存在尾随空格。我建议单独处理此问题。

- Alfe

你有没有想过这个的复杂度会是多少？如果w是文本中单词的数量，s是停用词列表中单词的数量，那么我认为循环的顺序将是w log s。在这种情况下，w约等于s，因此它是w log w。grep会比较慢，因为它（大致上）必须逐个字符进行匹配。 - mchangun

3

实际上，我认为O（...）的含义复杂性是相同的。两者都是O(w log s)，是的。但是正则表达式在更低的层面上实现并且进行了大量优化。分割单词本身会导致复制所有内容，创建一个字符串列表和列表本身，这些都需要花费宝贵的时间。 - Alfe

这种方法比分割行、单词标记化，然后检查停用词集合中的每个单词要快得多。特别是对于较大的文本输入。 - Bobs Burgers

7

首先，你需要为每个字符串创建停用词。只需创建一次即可。在这里使用集合会更好。

forbidden_words = set(stopwords.words('english'))

稍后，将 join 中的 [] 去掉，改用生成器。

替换为：

' '.join([x for x in ['a', 'b', 'c']])

使用

' '.join(x for x in ['a', 'b', 'c'])

下一步需要处理的是让.split()返回值而不是数组。我相信使用正则表达式会是一个很好的替代方法。参见此线程，了解为什么s.split()实际上很快。

最后，对于6m个字符串中的停用词，需要并行处理，这是完全不同的话题。

- Krzysztof Szularz

1

我怀疑使用正则表达式并不能改进，可以参考https://dev59.com/_Gs05IYBdhLWcg3wAdWT#7501659。 - alko

刚刚也找到了。 :) - Krzysztof Szularz

1

谢谢。使用set至少提高了8倍的速度。为什么使用生成器有帮助呢？对我来说，RAM不是问题，因为每个文本片段都很小，大约100-200个单词。 - mchangun

2

实际上，我发现使用列表推导式的join函数比等价的生成器表达式性能更好。 - Janne Karila

1

集合差似乎也可以工作 clean_text = set(text.lower().split()) - set(stopwords.words('english')) - wmik

2

尝试使用正则表达式来移除停用词，避免循环操作：

import re
from nltk.corpus import stopwords

cachedStopWords = stopwords.words("english")
pattern = re.compile(r'\b(' + r'|'.join(cachedStopwords) + r')\b\s*')
text = pattern.sub('', text)

- Anurag Dhadse

0

使用普通字典似乎是迄今为止最快的解决方案。
甚至比计数器解决方案快约10%。

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'
text = " ".join([word for word in text.split() if word not in stopwords_dict])

使用cProfile性能分析器进行了测试

您可以在此处找到使用的测试代码： https://gist.github.com/maxandron/3c276924242e7d29d9cf980da0a8a682

编辑：

此外，如果我们用循环替换列表推导式，性能会再提高20％

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'

new = ""
for word in text.split():
    if word not in stopwords_dict:
        new += word
text = new

- maxandron

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andy Rimmer · Accepted Answer

尝试缓存停用词对象，如下所示。每次调用函数时构造此对象似乎是瓶颈。

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

我使用性能分析器运行了如下命令: python -m cProfile -s cumulative test.py，下面是相关行。

nCalls Cumulative Time

10000 7.723 words.py:7(testFuncOld)

10000 0.140 words.py:11(testFuncNew)

因此，缓存stopwords实例可提高大约70倍的速度。