如何提取句子中的主语及其相关从句?

19
我正在尝试对句子进行主题提取,以便根据主题获取情感。我在Python2.7中使用nltk来实现此目的。以以下句子为例:Donald Trump是美国最差的总统,但Hillary比他更好。我们可以看到Donald Trump和Hillary是两个主题,与Donald Trump相关的情感是负面的,但与Hillary相关的情感是积极的。到目前为止,我已经能够将这个句子分成名词短语块,并且我能够得到以下结果:
(S
  (NP Donald/NNP Trump/NNP)
  is/VBZ
  (NP the/DT worst/JJS president/NN)
  in/IN
  (NP USA,/NNP)
  but/CC
  (NP Hillary/NNP)
  is/VBZ
  better/JJR
  than/IN
  (NP him/PRP))

现在,我该如何从这些名词短语中找到主题?然后,我该如何将适用于两个主题的短语分组在一起?一旦我有了适用于两个主题的短语, 我可以对它们分别执行情感分析。 编辑 我研究了@Krzysiek提到的库(spacy), 它还给了我句子中的依存树。
下面是代码:
from spacy.en import English
parser = English()

example = u"Donald Trump is the worst president of USA, but Hillary is better than him"
parsedEx = parser(example)
# shown as: original token, dependency tag, head word, left dependents, right dependents
for token in parsedEx:
    print(token.orth_, token.dep_, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights])

这里是依赖树:

(u'Donald', u'compound', u'Trump', [], [])
(u'Trump', u'nsubj', u'is', [u'Donald'], [])
(u'is', u'ROOT', u'is', [u'Trump'], [u'president', u',', u'but', u'is'])
(u'the', u'det', u'president', [], [])
(u'worst', u'amod', u'president', [], [])
(u'president', u'attr', u'is', [u'the', u'worst'], [u'of'])
(u'of', u'prep', u'president', [], [u'USA'])
(u'USA', u'pobj', u'of', [], [])
(u',', u'punct', u'is', [], [])
(u'but', u'cc', u'is', [], [])
(u'Hillary', u'nsubj', u'is', [], [])
(u'is', u'conj', u'is', [u'Hillary'], [u'better'])
(u'better', u'acomp', u'is', [], [u'than'])
(u'than', u'prep', u'better', [], [u'him'])
(u'him', u'pobj', u'than', [], [])

这提供了关于句子中不同标记依赖关系的深入洞察。这里是描述不同对之间依赖关系的论文link链接。我该如何使用此树将上下文单词附加到不同主题的标记上?
2个回答

18

我更深入地研究了spacy库,最终通过依赖管理找到了解决方案。感谢这个仓库,我学会了如何在我的主语动词宾语(使其成为SVAO)中包括形容词,并从查询中排除复合主题。以下是我的解决方案:

from nltk.stem.wordnet import WordNetLemmatizer
from spacy.lang.en import English

SUBJECTS = ["nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"]
OBJECTS = ["dobj", "dative", "attr", "oprd"]
ADJECTIVES = ["acomp", "advcl", "advmod", "amod", "appos", "nn", "nmod", "ccomp", "complm",
              "hmod", "infmod", "xcomp", "rcmod", "poss"," possessive"]
COMPOUNDS = ["compound"]
PREPOSITIONS = ["prep"]

def getSubsFromConjunctions(subs):
    moreSubs = []
    for sub in subs:
        # rights is a generator
        rights = list(sub.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if "and" in rightDeps:
            moreSubs.extend([tok for tok in rights if tok.dep_ in SUBJECTS or tok.pos_ == "NOUN"])
            if len(moreSubs) > 0:
                moreSubs.extend(getSubsFromConjunctions(moreSubs))
    return moreSubs

def getObjsFromConjunctions(objs):
    moreObjs = []
    for obj in objs:
        # rights is a generator
        rights = list(obj.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if "and" in rightDeps:
            moreObjs.extend([tok for tok in rights if tok.dep_ in OBJECTS or tok.pos_ == "NOUN"])
            if len(moreObjs) > 0:
                moreObjs.extend(getObjsFromConjunctions(moreObjs))
    return moreObjs

def getVerbsFromConjunctions(verbs):
    moreVerbs = []
    for verb in verbs:
        rightDeps = {tok.lower_ for tok in verb.rights}
        if "and" in rightDeps:
            moreVerbs.extend([tok for tok in verb.rights if tok.pos_ == "VERB"])
            if len(moreVerbs) > 0:
                moreVerbs.extend(getVerbsFromConjunctions(moreVerbs))
    return moreVerbs

def findSubs(tok):
    head = tok.head
    while head.pos_ != "VERB" and head.pos_ != "NOUN" and head.head != head:
        head = head.head
    if head.pos_ == "VERB":
        subs = [tok for tok in head.lefts if tok.dep_ == "SUB"]
        if len(subs) > 0:
            verbNegated = isNegated(head)
            subs.extend(getSubsFromConjunctions(subs))
            return subs, verbNegated
        elif head.head != head:
            return findSubs(head)
    elif head.pos_ == "NOUN":
        return [head], isNegated(tok)
    return [], False

def isNegated(tok):
    negations = {"no", "not", "n't", "never", "none"}
    for dep in list(tok.lefts) + list(tok.rights):
        if dep.lower_ in negations:
            return True
    return False

def findSVs(tokens):
    svs = []
    verbs = [tok for tok in tokens if tok.pos_ == "VERB"]
    for v in verbs:
        subs, verbNegated = getAllSubs(v)
        if len(subs) > 0:
            for sub in subs:
                svs.append((sub.orth_, "!" + v.orth_ if verbNegated else v.orth_))
    return svs

def getObjsFromPrepositions(deps):
    objs = []
    for dep in deps:
        if dep.pos_ == "ADP" and dep.dep_ == "prep":
            objs.extend([tok for tok in dep.rights if tok.dep_  in OBJECTS or (tok.pos_ == "PRON" and tok.lower_ == "me")])
    return objs

def getAdjectives(toks):
    toks_with_adjectives = []
    for tok in toks:
        adjs = [left for left in tok.lefts if left.dep_ in ADJECTIVES]
        adjs.append(tok)
        adjs.extend([right for right in tok.rights if tok.dep_ in ADJECTIVES])
        tok_with_adj = " ".join([adj.lower_ for adj in adjs])
        toks_with_adjectives.extend(adjs)

    return toks_with_adjectives

def getObjsFromAttrs(deps):
    for dep in deps:
        if dep.pos_ == "NOUN" and dep.dep_ == "attr":
            verbs = [tok for tok in dep.rights if tok.pos_ == "VERB"]
            if len(verbs) > 0:
                for v in verbs:
                    rights = list(v.rights)
                    objs = [tok for tok in rights if tok.dep_ in OBJECTS]
                    objs.extend(getObjsFromPrepositions(rights))
                    if len(objs) > 0:
                        return v, objs
    return None, None

def getObjFromXComp(deps):
    for dep in deps:
        if dep.pos_ == "VERB" and dep.dep_ == "xcomp":
            v = dep
            rights = list(v.rights)
            objs = [tok for tok in rights if tok.dep_ in OBJECTS]
            objs.extend(getObjsFromPrepositions(rights))
            if len(objs) > 0:
                return v, objs
    return None, None

def getAllSubs(v):
    verbNegated = isNegated(v)
    subs = [tok for tok in v.lefts if tok.dep_ in SUBJECTS and tok.pos_ != "DET"]
    if len(subs) > 0:
        subs.extend(getSubsFromConjunctions(subs))
    else:
        foundSubs, verbNegated = findSubs(v)
        subs.extend(foundSubs)
    return subs, verbNegated

def getAllObjs(v):
    # rights is a generator
    rights = list(v.rights)
    objs = [tok for tok in rights if tok.dep_ in OBJECTS]
    objs.extend(getObjsFromPrepositions(rights))

    potentialNewVerb, potentialNewObjs = getObjFromXComp(rights)
    if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
        objs.extend(potentialNewObjs)
        v = potentialNewVerb
    if len(objs) > 0:
        objs.extend(getObjsFromConjunctions(objs))
    return v, objs

def getAllObjsWithAdjectives(v):
    # rights is a generator
    rights = list(v.rights)
    objs = [tok for tok in rights if tok.dep_ in OBJECTS]

    if len(objs)== 0:
        objs = [tok for tok in rights if tok.dep_ in ADJECTIVES]

    objs.extend(getObjsFromPrepositions(rights))

    potentialNewVerb, potentialNewObjs = getObjFromXComp(rights)
    if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
        objs.extend(potentialNewObjs)
        v = potentialNewVerb
    if len(objs) > 0:
        objs.extend(getObjsFromConjunctions(objs))
    return v, objs

def findSVOs(tokens):
    svos = []
    verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"]
    for v in verbs:
        subs, verbNegated = getAllSubs(v)
        # hopefully there are subs, if not, don't examine this verb any longer
        if len(subs) > 0:
            v, objs = getAllObjs(v)
            for sub in subs:
                for obj in objs:
                    objNegated = isNegated(obj)
                    svos.append((sub.lower_, "!" + v.lower_ if verbNegated or objNegated else v.lower_, obj.lower_))
    return svos

def findSVAOs(tokens):
    svos = []
    verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"]
    for v in verbs:
        subs, verbNegated = getAllSubs(v)
        # hopefully there are subs, if not, don't examine this verb any longer
        if len(subs) > 0:
            v, objs = getAllObjsWithAdjectives(v)
            for sub in subs:
                for obj in objs:
                    objNegated = isNegated(obj)
                    obj_desc_tokens = generate_left_right_adjectives(obj)
                    sub_compound = generate_sub_compound(sub)
                    svos.append((" ".join(tok.lower_ for tok in sub_compound), "!" + v.lower_ if verbNegated or objNegated else v.lower_, " ".join(tok.lower_ for tok in obj_desc_tokens)))
    return svos

def generate_sub_compound(sub):
    sub_compunds = []
    for tok in sub.lefts:
        if tok.dep_ in COMPOUNDS:
            sub_compunds.extend(generate_sub_compound(tok))
    sub_compunds.append(sub)
    for tok in sub.rights:
        if tok.dep_ in COMPOUNDS:
            sub_compunds.extend(generate_sub_compound(tok))
    return sub_compunds

def generate_left_right_adjectives(obj):
    obj_desc_tokens = []
    for tok in obj.lefts:
        if tok.dep_ in ADJECTIVES:
            obj_desc_tokens.extend(generate_left_right_adjectives(tok))
    obj_desc_tokens.append(obj)

    for tok in obj.rights:
        if tok.dep_ in ADJECTIVES:
            obj_desc_tokens.extend(generate_left_right_adjectives(tok))

    return obj_desc_tokens

现在当您传递查询,例如:
from spacy.lang.en import English
parser = English()

sentence = u"""
Donald Trump is the worst president of USA, but Hillary is better than him
"""

parse = parser(sentence)
print(findSVAOs(parse))

您将获得以下内容:
[(u'donald trump', u'is', u'worst president'), (u'hillary', u'is', u'better')]

感谢@Krzysiek提供的解决方案,但我无法深入了解您的库以进行修改。相反,我尝试修改上述链接来解决我的问题。


很好,spacy真的很强大啊 ;) 顺便说一下,我刚刚在我的库中实现了传递管道结构列表的可能性,以便在单个语句中获取多个元组 ;) 我想知道你的解决方案是否适用于非常不同的语句示例? - Krzysiek
3
我得到的结果是一个空列表。你能用吗? - Ashok Kumar Jayaraman
我一直在尝试使用描述的方法或者使用指定模型en_core_web_lg来运行代码,但是对于同一个例子句子,两种情况都没有返回任何结果。我可能做错了什么?我正在使用最新版本的spacy和它的模型。 - Daniel Gannota
我也得到了一个空列表[]。我在上面的注释中添加了所有更改。"en"没有加载,所以我使用了:parser = spacy.load('en_core_web_md', disable=['ner','textcat'])。此外,我不得不删除"parser = English()",然后它才能正常工作。 - RandomTask
最初,我只能从每个句子中提取一个SVAO,而不是([(u'donald trump', u'is', u'worst president'), (u'hillary', u'is', u'better')])。为了解决这个问题,我将verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"]更改为verbs = [tok for tok in tokens if tok.pos_ == "VERB" or tok.dep_ != "aux"] - Michael Lilley
显示剩余5条评论

14

最近我刚刚解决了非常类似的问题 - 我需要提取主题、动作和对象。我开源了我的工作,所以你可以检查这个库:https://github.com/krzysiekfonal/textpipeliner

这是基于spacy(与nltk相对)但也是基于句子树。

因此,例如,让我们拿这个文档作为示例嵌入到spacy中:

import spacy
nlp = spacy.load("en")
doc = nlp(u"The Empire of Japan aimed to dominate Asia and the " \
               "Pacific and was already at war with the Republic of China " \
               "in 1937, but the world war is generally said to have begun on " \
               "1 September 1939 with the invasion of Poland by Germany and " \
               "subsequent declarations of war on Germany by France and the United Kingdom. " \
               "From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered " \
               "or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. " \
               "Under the Molotov-Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and " \
               "annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. " \
               "The war continued primarily between the European Axis powers and the coalition of the United Kingdom " \
               "and the British Commonwealth, with campaigns including the North Africa and East Africa campaigns, " \
               "the aerial Battle of Britain, the Blitz bombing campaign, the Balkan Campaign as well as the " \
               "long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion " \
               "of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part " \
               "of the Axis' military forces into a war of attrition. In December 1941, Japan attacked " \
               "the United States and European territories in the Pacific Ocean, and quickly conquered much of " \
               "the Western Pacific.")

现在,您可以创建一个简单的管道结构(关于管道的更多信息,请查看此项目的readme文档):

pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/*"),
                                 NamedEntityFilterPipe(),
                                 NamedEntityExtractorPipe()]),
                   FindTokensPipe("VERB"),
                   AnyPipe([SequencePipe([FindTokensPipe("VBD/dobj/NNP"),
                                          AggregatePipe([NamedEntityFilterPipe("GPE"), 
                                                NamedEntityFilterPipe("PERSON")]),
                                          NamedEntityExtractorPipe()]),
                            SequencePipe([FindTokensPipe("VBD/**/*/pobj/NNP"),
                                          AggregatePipe([NamedEntityFilterPipe("LOC"), 
                                                NamedEntityFilterPipe("PERSON")]),
                                          NamedEntityExtractorPipe()])])]

engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
engine.process()

最后你会得到以下结果:

>>>[([Germany], [conquered], [Europe]),
 ([Japan], [attacked], [the, United, States])]

实际上,它在找到管道方面强烈依赖于另一个库 - grammaregex。您可以从以下帖子中了解更多信息: https://medium.com/@krzysiek89dev/grammaregex-library-regex-like-for-text-mining-49e5706c9c6d#.zgx7odhsc

编辑

实际上,我在自述文件中展示的示例是舍弃形容词,但您只需要根据自己的需求调整传递给引擎的管道结构即可。 例如,对于您的样本句子,我可以提出这样的结构/解决方案,每个句子给您一个由3个元素(主语、动词、形容词)组成的元组:

import spacy
from textpipeliner import PipelineEngine, Context
from textpipeliner.pipes import *

pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/NNP"),
                                 NamedEntityFilterPipe(),
                                 NamedEntityExtractorPipe()]),
                       AggregatePipe([FindTokensPipe("VERB"),
                                      FindTokensPipe("VERB/xcomp/VERB/aux/*"),
                                      FindTokensPipe("VERB/xcomp/VERB")]),
                       AnyPipe([FindTokensPipe("VERB/[acomp,amod]/ADJ"),
                                AggregatePipe([FindTokensPipe("VERB/[dobj,attr]/NOUN/det/DET"),
                                               FindTokensPipe("VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
                      ]

engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
engine.process()

它将为您提供结果:

[([Donald, Trump], [is], [the, worst])]

稍微有些复杂的是你有复合句,而这个库会针对每个句子生成一个元组 - 我很快就会添加一种可能性(我也需要它来完成我的项目),即向引擎传递一系列的管道结构,以允许生成更多的元组。但现在,您可以通过为复合句创建第二个引擎来解决它,其结构仅与VERB / conj / VERB不同,而不是VERB(这些正则表达式始终从ROOT开始,因此VERB / conj / VERB将带您到复合句中的第二个动词):

pipes_structure_comp = [SequencePipe([FindTokensPipe("VERB/conj/VERB/nsubj/NNP"),
                                 NamedEntityFilterPipe(),
                                 NamedEntityExtractorPipe()]),
                   AggregatePipe([FindTokensPipe("VERB/conj/VERB"),
                                  FindTokensPipe("VERB/conj/VERB/xcomp/VERB/aux/*"),
                                  FindTokensPipe("VERB/conj/VERB/xcomp/VERB")]),
                   AnyPipe([FindTokensPipe("VERB/conj/VERB/[acomp,amod]/ADJ"),
                            AggregatePipe([FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/det/DET"),
                                           FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
                  ]

engine2 = PipelineEngine(pipes_structure_comp, Context(doc), [0,1,2])

现在您运行两个引擎后,您将获得预期的结果:)

engine.process()
engine2.process()
[([Donald, Trump], [is], [the, worst])]
[([Hillary], [is], [better])]

我想这就是你需要的。当然,我只是快速为给定的示例句子创建了一个管道结构,并且它不适用于每种情况,但我看到了很多句子结构,它已经实现了相当大的百分比,但是,对于目前无法处理的情况,您可以添加更多的FindTokensPipe等管道,我相信经过一些调整,您将涵盖大量可能的句子(英语并不太复杂,所以...:)


我尝试了你的解决方案,但它会忽略用于不同主语的形容词/副词。由于我需要执行情感分析,这将使所有陈述/关系变为中性。 - psr
@Krzysiek 在输入 engine2.process() 后出现错误:AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'next_sent' - Jaffer Wilson
@JafferWilson 谢谢。应该是Context,它是spacy doc的一种包装器。我已经更正了。 - Krzysiek
3
当我尝试您的代码时,使用您提供的示例输出结果为 [] - Jaffer Wilson
@JafferWilson,我刚试了一下,它运行得很好。有点令人困惑的是,在“EDITED”之前的部分适用于我的示例(“日本帝国...”),但在“EDITED”之后的代码被调整为适应问题作者的需求和他的陈述:“唐纳德·特朗普是美国最差的总统,但希拉里比他更好”。 - Krzysiek
显示剩余3条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接