使用NLTK获取一个单词的所有可能词性标记

Question

使用NLTK获取一个单词的所有可能词性标记

pythonnltkpart-of-speech

3

有些单词可能有多个不同的词性标记（pos）。例如，'Stick'既可以是名词也可以是动词。

NLTK中的词性标注器会根据上下文猜测正确的标记，并返回唯一的一个猜测结果。我应该如何获取任意给定单词的所有可能标记列表呢？

- O James

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- alvas · Accepted Answer

TL;DR

无法对默认的pos_tag函数进行更改。

In Long

对于默认的pos_tag函数，无法进行更改。

pos_tag函数是从AveragedPerceptron对象中获取的，该对象使用predict()函数来获取最有可能的标记：https://github.com/nltk/nltk/blob/develop/nltk/tag/perceptron.py#L48

该函数从可能标记的列表中返回argmax：

def predict(self, features):
    '''Dot-product the features and current weights and return the best label.'''
    scores = defaultdict(float)
    for feat, value in features.items():
        if feat not in self.weights or value == 0:
            continue
        weights = self.weights[feat]
        for label, weight in weights.items():
            scores[label] += value * weight
    # Do a secondary alphabetic sort, for stability
    return max(self.classes, key=lambda label: (scores[label], label))

如果您更改代码并让其返回self.classes，则可以有效地获取每个可能标签的得分。

但是由于tag()中使用的特征需要前两个标签作为特征https://github.com/nltk/nltk/blob/develop/nltk/tag/perceptron.py#L156

def tag(self, tokens):
    '''
    Tag tokenized sentences.
    :params tokens: list of word
    :type tokens: list(str)
    '''
    prev, prev2 = self.START
    output = []

    context = self.START + [self.normalize(w) for w in tokens] + self.END
    for i, word in enumerate(tokens):
        tag = self.tagdict.get(word)
        if not tag:
            features = self._get_features(i, word, context, prev, prev2)
            tag = self.model.predict(features)
        output.append((word, tag))
        prev2 = prev
        prev = tag

    return output

返回结果：

返回n个最佳标签的任务需要改变标记器简单的一次性“贪心”方式，转而使用需要beam的方法。