为什么pos_tag()如此缓慢，是否可以避免这种情况？

Question

为什么pos_tag()如此缓慢，是否可以避免这种情况？

8

我希望能够逐句获取POS标记，就像这样：

def __remove_stop_words(self, tokenized_text, stop_words):

    sentences_pos = nltk.pos_tag(tokenized_text)  
    filtered_words = [word for (word, pos) in sentences_pos 
                      if pos not in stop_words and word not in stop_words]

    return filtered_words

但问题在于pos_tag()对每句话需要约一秒的时间。使用pos_tag_sents()可以批量处理并加快速度。但如果我能逐句处理，我的生活将变得更加轻松。

有没有更快的方法呢？

- Stefan Falk

你正在使用哪个版本的nltk？（即 nltk.__version__） - unutbu

背景：https://dev59.com/iZHea4cB1Zd3GeqPjwcb#WK8UoYgBc1ULPQZF5eMh - tripleee

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- unutbu · Accepted Answer

对于nltk版本3.1，在nltk/tag/__init__.py中，pos_tag的定义如下：

from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
    tagger = PerceptronTagger()
    return _pos_tag(tokens, tagset, tagger)

因此，每次调用 pos_tag 都会先实例化 PerceptronTagger，这需要一些时间，因为它涉及加载一个 pickle 文件。当 tagset 为 None 时，_pos_tag 只是调用 xpather.tag。

因此，您可以通过仅加载文件一次，并自行调用iagger.tag而不是调用pos_tag来节省一些时间：

from nltk.tag.perceptron import PerceptronTagger
tagger = PerceptronTagger() 
def __remove_stop_words(self, tokenized_text, stop_words, tagger=tagger):
    sentences_pos = tagger.tag(tokenized_text)  
    filtered_words = [word for (word, pos) in sentences_pos 
                      if pos not in stop_words and word not in stop_words]

    return filtered_words

pos_tag_sents 与上面提到的技巧相同--在多次调用_pos_tag之前，它只实例化了一次PerceptronTagger。因此，通过使用上面的代码或重构并调用pos_tag_sents调用，您将获得可比较的性能提升。

另外，如果stop_words是一个长列表，则可以将其转换为集合来节省一些时间：

stop_words = set(stop_words)

因为检查一个集合中是否包含某个元素 (例如, pos not in stop_words) 是一个 O(1) 的常数时间操作，而检查列表中是否包含某个元素则是一个 O(n) 操作 (即，它需要与列表长度成比例增长的时间)。