如何使用Spacy找到最常见的单词？

Question

如何使用Spacy找到最常见的单词？

pythonnlpspacy

26

我正在使用Python中的spacy进行词性标注，效果很好，但我想知道是否有可能找出字符串中最常见的单词。另外，是否有可能获取最常见的名词、动词、副词等等？

虽然spacy中包含count_by函数，但我无法以有意义的方式运行它。

- Harry Loyd

3个回答

12

在Python中，这应该与计算任何其他东西基本相同。spaCy使您可以遍历文档，并返回一系列Token对象。这些对象可用于访问注释。

from __future__ import print_function, unicode_literals
import spacy
from collections import defaultdict, Counter

nlp = spacy.load('en')

pos_counts = defaultdict(Counter)
doc = nlp(u'My text here.')

for token in doc:
    pos_counts[token.pos][token.orth] += 1

for pos_id, counts in sorted(pos_counts.items()):
    pos = doc.vocab.strings[pos_id]
    for orth_id, count in counts.most_common():
        print(pos, count, doc.vocab.strings[orth_id])

注意，.orth和.pos属性是整数。您可以通过.orth_和.pos_属性获取它们映射到的字符串。.orth属性是令牌的未规范化视图，还有.lower，.lemma等字符串视图。您可能需要绑定一个.norm函数来进行自己的字符串规范化。详见文档。

整数对于计数非常有用，因为如果您在大型语料库上进行计数，可以使计数程序更加内存高效。您也可以将频繁计数存储在numpy数组中，以提高速度和效率。如果您不想费心处理这些，请随意直接使用.orth_属性进行计数，或者使用其别名.text。

请注意，上面段落中的.pos属性给出了粗粒度的词性标签集。更丰富的树库标签可在.tag属性上获得。

- syllogism_

从谷歌跳转到这里。这种方法是否比https://github.com/explosion/spaCy/issues/139更受欢迎？ - astrojuanlu

1

请提供更多上下文。这段内容的背景和意图不明确，需要更多信息来确保准确翻译。 - Decula

6

我很晚才回复这个帖子。不过，实际上，Spacy 提供了一种使用 doc.count_by() 函数来完成此操作的内置方法。请注意保留 HTML 标记。

import spacy
import spacy.attrs
nlp = spacy.load("en_core_web_sm")
doc = nlp("It all happened between November 2007 and November 2008")

# Returns integers that map to parts of speech
counts_dict = doc.count_by(spacy.attrs.IDS['POS'])

# Print the human readable part of speech tags
for pos, count in counts_dict.items():
    human_readable_tag = doc.vocab[pos].text
    print(human_readable_tag, count)

输出结果如下：

VERB 1
ADP 1
CCONJ 1
DET 1
NUM 2
PRON 1
PROPN 2

- kalidurge

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Paras Dahal · Accepted Answer

最近我需要统计一个文本文件中所有标记的频率。您可以使用pos_属性过滤单词以获取所需的POS标记。以下是一个简单的例子：

import spacy
from collections import Counter
nlp = spacy.load('en')
doc = nlp(u'Your text here')
# all tokens that arent stop words or punctuations
words = [token.text
         for token in doc
         if not token.is_stop and not token.is_punct]

# noun tokens that arent stop words or punctuations
nouns = [token.text
         for token in doc
         if (not token.is_stop and
             not token.is_punct and
             token.pos_ == "NOUN")]

# five most common tokens
word_freq = Counter(words)
common_words = word_freq.most_common(5)

# five most common noun tokens
noun_freq = Counter(nouns)
common_nouns = noun_freq.most_common(5)