Python中的WordNet词形还原和词性标注

77

我想在Python中使用WordNet词形还原器,了解到默认的词性标记是NOUN,并且除非将词性标记明确指定为VERB,否则它不会输出动词的正确词形还原结果。

我的问题是,如何最好地准确执行上述词形还原操作?

我使用了nltk.pos_tag进行了词性标注,但我不知道如何将树库词性标记集成到与WordNet兼容的词性标记中。请帮忙。

from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
tagged = nltk.pos_tag(tokens)

我得到了标记为NN、JJ、VB和RB的输出标签。我如何将它们更改为与wordnet兼容的标签?

另外,我是否需要使用已标记语料库来训练nltk.pos_tag(),还是可以直接在我的数据上使用它进行评估?


确实是一个非常好的问题! - new QOpenGLWidget
8个回答

97

首先,您可以直接使用nltk.pos_tag()而无需对其进行训练。该函数将从文件中加载预训练的标注器。您可以使用nltk.tag._POS_TAGGER查看文件名:

nltk.tag._POS_TAGGER
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle' 

作为Treebank语料库的训练对象,它也使用Treebank标签集
下面的函数将把Treebank标签映射到WordNet词性名称:
from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

你可以使用词形还原器的返回值:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', wordnet.VERB)
>>> 'go'

在将其传递给Lemmatizer之前,请检查返回值,因为空字符串会导致KeyError


14
请记得也要翻译卫星形容词 =) ADJ_SAT = 's',详情请见:http://wordnet.princeton.edu/wordnet/man/wngloss.7WN.html - alvas
2
“'I'm loving it.'”字符串中“'it'”的POS标签为“PRP”。该函数返回一个空字符串,而词形还原器不接受并抛出了“KeyError”。在这种情况下应该怎么办? - Clock Slave
我更愿意使用像...treebank_tag[0].lower()这样的输入作为词形还原器pos-tag的输入。在大多数情况下,这可以涵盖转换,除了形容词。但这可以通过一个简单的if语句来解决。 - zwep
1
@ClockSlave:不要将空字符串放入词形还原器中。 - Suzana
6
哪些树库标签应该映射到ADJ_SAT WordNet标签? - Simon Hessner
显示剩余2条评论

15

转换步骤:文档 -> 句子 -> 标记 -> 词性标注 -> 词形

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

#example text text = 'What can I say about this place. The staff of these restaurants is nice and the eggplant is not bad'

class Splitter(object):
    """
    split the document into sentences and tokenize each sentence
    """
    def __init__(self):
        self.splitter = nltk.data.load('tokenizers/punkt/english.pickle')
        self.tokenizer = nltk.tokenize.TreebankWordTokenizer()

    def split(self,text):
        """
        out : ['What', 'can', 'I', 'say', 'about', 'this', 'place', '.']
        """
        # split into single sentence
        sentences = self.splitter.tokenize(text)
        # tokenization in each sentences
        tokens = [self.tokenizer.tokenize(sent) for sent in sentences]
        return tokens


class LemmatizationWithPOSTagger(object):
    def __init__(self):
        pass
    def get_wordnet_pos(self,treebank_tag):
        """
        return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
        """
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            # As default pos in lemmatization is Noun
            return wordnet.NOUN

    def pos_tag(self,tokens):
        # find the pos tagginf for each tokens [('What', 'WP'), ('can', 'MD'), ('I', 'PRP') ....
        pos_tokens = [nltk.pos_tag(token) for token in tokens]

        # lemmatization using pos tagg   
        # convert into feature set of [('What', 'What', ['WP']), ('can', 'can', ['MD']), ... ie [original WORD, Lemmatized word, POS tag]
        pos_tokens = [ [(word, lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag)), [pos_tag]) for (word,pos_tag) in pos] for pos in pos_tokens]
        return pos_tokens

lemmatizer = WordNetLemmatizer()
splitter = Splitter()
lemmatization_using_pos_tagger = LemmatizationWithPOSTagger()

#step 1 split document into sentence followed by tokenization
tokens = splitter.split(text)

#step 2 lemmatization using pos tagger 
lemma_pos_token = lemmatization_using_pos_tagger.pos_tag(tokens)
print(lemma_pos_token)

12

3
更一般地说:从nltk.corpus中导入wordnet; 打印wordnet._FILEMAP; - mPrinC
为什么ADJ_SAT没有在POST_LIST中表示?什么是ADJ_SAT形容词的例子? - Simon Hessner
ADJ_SAT 属于形容词簇。你可以在这里阅读有关形容词簇如何排列的更多信息:https://wordnet.princeton.edu/documentation/wngloss7wn - pg2455

9
您可以使用Python的默认dict来创建映射,并利用词形还原器默认标签为名词的特点。
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict

tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

text = "Another way of achieving this task"
tokens = word_tokenize(text)
lmtzr = WordNetLemmatizer()

for token, tag in pos_tag(tokens):
    lemma = lmtzr.lemmatize(token, tag_map[tag[0]])
    print(token, "=>", lemma)

1
记得导入wn以使答案包含全部内容: from nltk.corpus import wordnet as wn - pragMATHiC
@pragMATHiC,已经包含了。谢谢。 - Shuchita Banthia

6

@Suzana_K正在工作。但是,正如@Clock Slave提到的那样,有些情况会导致KeyError错误。

将树库标签转换为Wordnet标记。

from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None # for easy if-statement 

现在,我们只有在具有WordNet标记时才将pos输入lemmatize函数。
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
tagged = nltk.pos_tag(tokens)
for word, tag in tagged:
    wntag = get_wordnet_pos(tag)
    if wntag is None:# not supply tag in case of None
        lemma = lemmatizer.lemmatize(word) 
    else:
        lemma = lemmatizer.lemmatize(word, pos=wntag) 

2
您可以按照以下步骤操作:
import nltk
from nltk.corpus import wordnet

wordnet_map = {
    "N": wordnet.NOUN,
    "V": wordnet.VERB,
    "J": wordnet.ADJ,
    "R": wordnet.ADV
}


def pos_tag_wordnet(text):
    """
        Create pos_tag with wordnet format
    """
    pos_tagged_text = nltk.pos_tag(text)

    # map the pos tagging output with wordnet output
    pos_tagged_text = [
        (word, wordnet_map.get(pos_tag[0])) if pos_tag[0] in wordnet_map.keys()
        else (word, wordnet.NOUN)
        for (word, pos_tag) in pos_tagged_text
    ]

    return pos_tagged_text

0

你可以用一行代码实现这个功能:

wnpos = lambda e: ('a' if e[0].lower() == 'j' else e[0].lower()) if e[0].lower() in ['n', 'r', 'v'] else 'n'

然后使用 wnpos(nltk_pos) 获取 POS 并传递给 .lemmatize()。在你的情况下,lmtzr.lemmatize(word=tagged[0][0], pos=wnpos(tagged[0][1]))


0

在互联网上搜索后,我找到了这个解决方案:从句子中分割、pos_tagging、词形还原和清理(去除标点符号和“停用词”)操作得出“单词袋”。 以下是我的代码:

from nltk.corpus import wordnet as wn
from nltk.wsd import lesk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize

punctuation = u",.?!()-_\"\'\\\n\r\t;:+*<>@#§^$%&|/"
stop_words_eng = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
tag_dict = {"J": wn.ADJ,
            "N": wn.NOUN,
            "V": wn.VERB,
            "R": wn.ADV}

def extract_wnpostag_from_postag(tag):
    #take the first letter of the tag
    #the second parameter is an "optional" in case of missing key in the dictionary 
    return tag_dict.get(tag[0].upper(), None)

def lemmatize_tupla_word_postag(tupla):
    """
    giving a tupla of the form (wordString, posTagString) like ('guitar', 'NN'), return the lemmatized word
    """
    tag = extract_wnpostag_from_postag(tupla[1])    
    return lemmatizer.lemmatize(tupla[0], tag) if tag is not None else tupla[0]

def bag_of_words(sentence, stop_words=None):
    if stop_words is None:
        stop_words = stop_words_eng
    original_words = word_tokenize(sentence)
    tagged_words = nltk.pos_tag(original_words) #returns a list of tuples: (word, tagString) like ('And', 'CC')
    original_words = None
    lemmatized_words = [ lemmatize_tupla_word_postag(ow) for ow in tagged_words ]
    tagged_words = None
    cleaned_words = [ w for w in lemmatized_words if (w not in punctuation) and (w not in stop_words) ]
    lemmatized_words = None
    return cleaned_words

sentence = "Two electric guitar rocks players, and also a better bass player, are standing off to two sides reading corpora while walking"
print(sentence, "\n\n bag of words:\n", bag_of_words(sentence) )

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接