Spacy词形还原是否存在问题，或者它是否不能对所有以“-ing”结尾的词进行词形还原？

Question

Spacy词形还原是否存在问题，或者它是否不能对所有以“-ing”结尾的词进行词形还原？

3

当我运行Spacy词形还原器时，它没有对单词"consulting"进行词形还原处理，因此我怀疑它出现了故障。

以下是我的代码：

nlp = spacy.load('en_core_web_trf', disable=['parser', 'ner'])
lemmatizer = nlp.get_pipe('lemmatizer')
doc = nlp('consulting')
print([token.lemma_ for token in doc])

我的输出：

['consulting']

- M_Neelakandan

2个回答

3

spaCy的词形归并器根据词性的不同表现出不同的行为。特别地，对于名词而言，“-ing”形式已被视为原型，并且不会改变。

以下是一个说明差异的示例：

import spacy

nlp = spacy.load("en_core_web_sm")

text = "While consulting, I sometimes tell people about the consulting business."
for tok in nlp(text):
    print(tok, tok.pos_, tok.lemma_, sep="\t")

输出：

While   SCONJ   while
consulting      VERB    consult
,       PUNCT   ,
I       PRON    I
sometimes       ADV     sometimes
tell    VERB    tell
people  NOUN    people
about   ADP     about
the     DET     the
consulting      NOUN    consulting
business        NOUN    business

看看这个动词“consult”作为引文，而名词则不是。

- polm23

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Kyle F Hartzenberg · Accepted Answer

spaCy的词形归并器没有出错，它的表现符合预期。词形归并取决于分配给标记的词性（PoS）标签，而PoS标记模型是基于句子/文档进行训练的，而不是单个标记（单词）。例如，基于斯坦福PoS标记器的parts-of-speech.info不允许您输入单个单词。

在您的情况下，单词“consulting”被标记为名词，您正在使用的spaCy模型认为“consulting”是这种情况下适当的词形。如果您将字符串更改为“consulting tomorrow”，则会看到spaCy将“consulting”词形还原为“consult”，因为它被标记为动词（请参见下面代码的输出）。简而言之，我建议不要尝试对单个标记执行词形还原，而是按照其设计意图，在句子/文档上使用该模型。

作为附注：确保您了解词元和词干之间的区别。如果您不确定，请阅读维基百科词元（形态学）页面上提供的this部分。

import spacy
nlp = spacy.load('en_core_web_trf', disable=['parser', 'ner'])
doc = nlp('consulting')
print([[token.pos_, token.lemma_] for token in doc])
# Output: [['NOUN', 'consulting']]
doc_verb = nlp('Consulting tomorrow')
print([[token.pos_, token.lemma_] for token in doc_verb])
# Output: [['VERB', 'consult'], ['NOUN', 'tomorrow']]

如果你确实需要对单词进行词形还原，this GeeksforGeeks Python 词形还原教程的第二种方法会产生“consult”的词元。我在此创建了一个简化版本以备将来参考，以防链接失效。我没有测试它在其他单个标记（单词）上的效果，因此可能并不适用于所有情况。

# Condensed version of approach #2 given in the GeeksforGeeks lemmatizer tutorial:
# https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet


# POS_TAGGER_FUNCTION : TYPE 1
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None


lemmatizer = WordNetLemmatizer()
sentence = 'consulting'
pos_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
lemmatized_sentence = []
for word, tag in pos_tagged:
    lemmatized_sentence.append(lemmatizer.lemmatize(word, pos_tagger(tag)))
print(lemmatized_sentence)
# Output: ['consult']