如何使用spacy/nltk生成二元/三元组

Question

如何使用spacy/nltk生成二元/三元组

12

输入文本始终是菜名列表，其中包含1~3个形容词和一个名词

输入

thai iced tea
spicy fried chicken
sweet chili pork
thai chicken curry

输出：

thai tea, iced tea
spicy chicken, fried chicken
sweet pork, chili pork
thai chicken, chicken curry, thai curry

基本上，我想解析句子树，并尝试通过将形容词与名词配对来生成二元组。

我希望能够使用spacy或nltk实现这一目标。

- samol

请参考以下链接：http://stackoverflow.com/a/34742540/610569 和 https://dev59.com/1GMm5IYBdhLWcg3wivf7。 - alvas

3个回答

5

您可以使用NLTK在几个步骤中实现此目标：

对序列进行PoS标记
生成所需的n元组（在您的示例中没有三元组，但是可以通过三元组生成跳过元组，然后去除中间标记）
丢弃所有不符合模式JJ NN的n元组。

范例：

def jjnn_pairs(phrase):
    '''
    Iterate over pairs of JJ-NN.
    '''
    tagged = nltk.pos_tag(nltk.word_tokenize(phrase))
    for ngram in ngramise(tagged):
        tokens, tags = zip(*ngram)
        if tags == ('JJ', 'NN'):
            yield tokens

def ngramise(sequence):
    '''
    Iterate over bigrams and 1,2-skip-grams.
    '''
    for bigram in nltk.ngrams(sequence, 2):
        yield bigram
    for trigram in nltk.ngrams(sequence, 3):
        yield trigram[0], trigram[2]

按照你的需求扩展模式('JJ', 'NN')和所需的n元组。

我认为不需要解析。然而，这种方法的主要问题是大多数PoS标记器可能无法完全按照你的意愿对所有内容进行标记。例如，我NLTK安装的默认PoS标记器将“chili”标记为NN而不是JJ，“fried”被标记为VBD。不过，解析无法帮助你解决这个问题！

- lenz

1

类似这样的东西：

>>> from nltk import bigrams
>>> text = """thai iced tea
... spicy fried chicken
... sweet chili pork
... thai chicken curry"""
>>> lines = map(str.split, text.split('\n'))
>>> for line in lines:
...     ", ".join([" ".join(bi) for bi in bigrams(line)])
... 
'thai iced, iced tea'
'spicy fried, fried chicken'
'sweet chili, chili pork'
'thai chicken, chicken curry'

或者使用colibricorehttps://proycon.github.io/colibri-core/doc/#installation;P

替代

- alvas

1

嘿，阿尔瓦斯，我特别想避免使用形容词形容词。例如，特别想避免使用“辣炸”的说法。 - samol

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Petr Matuska · Accepted Answer

我使用带英语模型的Spacy 2.0来查找名词和非名词，以解析输入，然后将非名词和名词组合在一起创建所需的输出。

您的输入：

s = ["thai iced tea",
"spicy fried chicken",
"sweet chili pork",
"thai chicken curry",]

Spacy解决方案：

import spacy
nlp = spacy.load('en') # import spacy, load model

def noun_notnoun(phrase):
    doc = nlp(phrase) # create spacy object
    token_not_noun = []
    notnoun_noun_list = []

    for item in doc:
        if item.pos_ != "NOUN": # separate nouns and not nouns
            token_not_noun.append(item.text)
        if item.pos_ == "NOUN":
            noun = item.text

    for notnoun in token_not_noun:
        notnoun_noun_list.append(notnoun + " " + noun)

    return notnoun_noun_list

调用函数：

for phrase in s:
    print(noun_notnoun(phrase))

结果：

['thai tea', 'iced tea']
['spicy chicken', 'fried chicken']
['sweet pork', 'chili pork']
['thai chicken', 'curry chicken']