我打算使用spacy和textacy来识别英语的句子结构。
例如: The cat sat on the mat - SVO , The cat jumped and picked up the biscuit - SVV0. The cat ate the biscuit and cookies. - SVOO.
这个程序应该能够读取一段话并将每个句子输出为SVO、SVOO、SVVO或其他自定义结构。
目前的努力:
例如: The cat sat on the mat - SVO , The cat jumped and picked up the biscuit - SVV0. The cat ate the biscuit and cookies. - SVOO.
这个程序应该能够读取一段话并将每个句子输出为SVO、SVOO、SVVO或其他自定义结构。
目前的努力:
# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Load Library files
import en_core_web_sm
import spacy
import textacy
nlp = en_core_web_sm.load()
SUBJ = ["nsubj","nsubjpass"]
VERB = ["ROOT"]
OBJ = ["dobj", "pobj", "dobj"]
text = nlp(u'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.')
sub_toks = [tok for tok in text if (tok.dep_ in SUBJ) ]
obj_toks = [tok for tok in text if (tok.dep_ in OBJ) ]
vrb_toks = [tok for tok in text if (tok.dep_ in VERB) ]
text_ext = list(textacy.extract.subject_verb_object_triples(text))
print("Subjects:", sub_toks)
print("VERB :", vrb_toks)
print("OBJECT(s):", obj_toks)
print ("SVO:", text_ext)
输出:
(u'Subjects:', [cat, cat, cat])
(u'VERB :', [sat, jumped, ate])
(u'OBJECT(s):', [mat, biscuit, biscuit])
(u'SVO:', [(cat, ate, biscuit), (cat, ate, cookies)])
- 问题1:SVO被覆盖了。为什么?
- 问题2:如何确定句子是SVOO、SVO、SVVO等类型?
编辑1:
我在构思一些方法。
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'I will go to the mall.'
doc = nlp(sentence)
chk_set = set(['PRP','MD','NN'])
result = chk_set.issubset(t.tag_ for t in doc)
if result == False:
print "SVO not identified"
elif result == True: # shouldn't do this
print "SVO"
else:
print "Others..."
Edit 2:
取得了进一步的进展。
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.'
doc = nlp(sentence)
print(" ".join([token.dep_ for token in doc]))
当前输出:
det nsubj ROOT prep det pobj punct det nsubj ROOT cc conj prt det dobj punct det nsubj ROOT dobj cc conj punct
预期输出:
SVO SVVO SVOO
想法是将依赖标签分解为简单的主语-谓语和宾语模型。
如果没有其他选项,考虑使用正则表达式来实现。但这是我的最后选择。
编辑3:
在研究此链接后,有所改进。
def testSVOs():
nlp = en_core_web_sm.load()
tok = nlp("The cat sat on the mat. The cat jumped for the biscuit. The cat ate biscuit and cookies.")
svos = findSVOs(tok)
print(svos)
当前输出:
[(u'cat', u'sat', u'mat'), (u'cat', u'jumped', u'biscuit'), (u'cat', u'ate', u'biscuit'), (u'cat', u'ate', u'cookies')]
预期输出:
我期望能得到句子的符号表示。虽然我可以提取SVO,但是如何将其转换为SVO符号表示更多的是模式识别而不是句子内容本身。
SVO SVO SVOO