句子结构识别 - Spacy

Question

句子结构识别 - Spacy

9

我打算使用spacy和textacy来识别英语的句子结构。

例如： The cat sat on the mat - SVO , The cat jumped and picked up the biscuit - SVV0. The cat ate the biscuit and cookies. - SVOO.

这个程序应该能够读取一段话并将每个句子输出为SVO、SVOO、SVVO或其他自定义结构。

目前的努力：

# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Load Library files
import en_core_web_sm
import spacy
import textacy
nlp = en_core_web_sm.load()
SUBJ = ["nsubj","nsubjpass"] 
VERB = ["ROOT"] 
OBJ = ["dobj", "pobj", "dobj"] 
text = nlp(u'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.')
sub_toks = [tok for tok in text if (tok.dep_ in SUBJ) ]
obj_toks = [tok for tok in text if (tok.dep_ in OBJ) ]
vrb_toks = [tok for tok in text if (tok.dep_ in VERB) ]
text_ext = list(textacy.extract.subject_verb_object_triples(text))
print("Subjects:", sub_toks)
print("VERB :", vrb_toks)
print("OBJECT(s):", obj_toks)
print ("SVO:", text_ext)

输出：

(u'Subjects:', [cat, cat, cat])
(u'VERB :', [sat, jumped, ate])
(u'OBJECT(s):', [mat, biscuit, biscuit])
(u'SVO:', [(cat, ate, biscuit), (cat, ate, cookies)])

问题1：SVO被覆盖了。为什么？
问题2：如何确定句子是SVOO、SVO、SVVO等类型？

编辑1：

我在构思一些方法。

from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'I will go to the mall.'
doc = nlp(sentence)
chk_set = set(['PRP','MD','NN'])
result = chk_set.issubset(t.tag_ for t in doc)
if result == False:
    print "SVO not identified"
elif result == True: # shouldn't do this
    print "SVO"
else:
    print "Others..."

Edit 2:

取得了进一步的进展。

from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.'
doc = nlp(sentence)
print(" ".join([token.dep_ for token in doc]))

当前输出：

det nsubj ROOT prep det pobj punct det nsubj ROOT cc conj prt det dobj punct det nsubj ROOT dobj cc conj punct

预期输出：

SVO SVVO SVOO

想法是将依赖标签分解为简单的主语-谓语和宾语模型。

如果没有其他选项，考虑使用正则表达式来实现。但这是我的最后选择。

编辑3：

在研究此链接后，有所改进。

def testSVOs():
    nlp = en_core_web_sm.load()
    tok = nlp("The cat sat on the mat. The cat jumped for the biscuit. The cat ate biscuit and cookies.")
    svos = findSVOs(tok)
    print(svos)

当前输出：

[(u'cat', u'sat', u'mat'), (u'cat', u'jumped', u'biscuit'), (u'cat', u'ate', u'biscuit'), (u'cat', u'ate', u'cookies')]

预期输出：

我期望能得到句子的符号表示。虽然我可以提取SVO，但是如何将其转换为SVO符号表示更多的是模式识别而不是句子内容本身。

SVO SVO SVOO

- Programmer_nltk

除了实际输出外，添加您期望的输出可能会有所帮助。 - CAB

Please see edit 2. - Programmer_nltk

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- igrinis · Accepted Answer

问题1：为什么SVO会被覆盖？

这是一个与textacy相关的问题。该部分功能存在问题，详见此博客。

问题2：如何将句子识别为SVOO、SVO、SVVO等？

您应该解析依赖树。SpaCy提供了这些信息，您只需要编写一组规则来提取它们，使用.head、.left、.right和.children属性即可。

>>for word in text: 
    print('%10s %5s %10s %10s %s'%(word.text, word.tag_, word.dep_, word.pos_, word.head.text_))

        The    DT        det        DET cat 
        cat    NN      nsubj       NOUN sat 
        sat   VBD       ROOT       VERB sat 
         on    IN       prep        ADP sat 
        the    DT        det        DET mat
        mat    NN       pobj       NOUN on 
          .     .      punct      PUNCT sat 
         of    IN       ROOT        ADP of 
        the    DT        det        DET lab
        art    NN   compound       NOUN lab
        lab    NN       pobj       NOUN of 
          .     .      punct      PUNCT of 
        The    DT        det        DET cat 
        cat    NN      nsubj       NOUN jumped 
     jumped   VBD       ROOT       VERB jumped 
        and    CC         cc      CCONJ jumped 
     picked   VBD       conj       VERB jumped 
         up    RP        prt       PART picked 
        the    DT        det        DET biscuit
    biscuit    NN       dobj       NOUN picked 
          .     .      punct      PUNCT jumped 
        The    DT        det        DET cat 
        cat    NN      nsubj       NOUN ate 
        ate   VBD       ROOT       VERB ate 
    biscuit    NN       dobj       NOUN ate 
        and    CC         cc      CCONJ biscuit 
    cookies   NNS       conj       NOUN biscuit 
          .     .      punct      PUNCT ate

我建议您查看此代码，只需将pobj添加到OBJECTS列表中，您就可以得到您的SVO和SVOO。稍微调整一下，您也可以得到SVVO。