最近我刚刚解决了非常类似的问题 - 我需要提取主题、动作和对象。我开源了我的工作,所以你可以检查这个库:https://github.com/krzysiekfonal/textpipeliner
这是基于spacy(与nltk相对)但也是基于句子树。
因此,例如,让我们拿这个文档作为示例嵌入到spacy中:
import spacy
nlp = spacy.load("en")
doc = nlp(u"The Empire of Japan aimed to dominate Asia and the " \
"Pacific and was already at war with the Republic of China " \
"in 1937, but the world war is generally said to have begun on " \
"1 September 1939 with the invasion of Poland by Germany and " \
"subsequent declarations of war on Germany by France and the United Kingdom. " \
"From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered " \
"or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. " \
"Under the Molotov-Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and " \
"annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. " \
"The war continued primarily between the European Axis powers and the coalition of the United Kingdom " \
"and the British Commonwealth, with campaigns including the North Africa and East Africa campaigns, " \
"the aerial Battle of Britain, the Blitz bombing campaign, the Balkan Campaign as well as the " \
"long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion " \
"of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part " \
"of the Axis' military forces into a war of attrition. In December 1941, Japan attacked " \
"the United States and European territories in the Pacific Ocean, and quickly conquered much of " \
"the Western Pacific.")
现在,您可以创建一个简单的管道结构(关于管道的更多信息,请查看此项目的readme文档):
pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/*"),
NamedEntityFilterPipe(),
NamedEntityExtractorPipe()]),
FindTokensPipe("VERB"),
AnyPipe([SequencePipe([FindTokensPipe("VBD/dobj/NNP"),
AggregatePipe([NamedEntityFilterPipe("GPE"),
NamedEntityFilterPipe("PERSON")]),
NamedEntityExtractorPipe()]),
SequencePipe([FindTokensPipe("VBD/**/*/pobj/NNP"),
AggregatePipe([NamedEntityFilterPipe("LOC"),
NamedEntityFilterPipe("PERSON")]),
NamedEntityExtractorPipe()])])]
engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
engine.process()
最后你会得到以下结果:
>>>[([Germany], [conquered], [Europe]),
([Japan], [attacked], [the, United, States])]
实际上,它在找到管道方面强烈依赖于另一个库 - grammaregex。您可以从以下帖子中了解更多信息:
https://medium.com/@krzysiek89dev/grammaregex-library-regex-like-for-text-mining-49e5706c9c6d#.zgx7odhsc
编辑
实际上,我在自述文件中展示的示例是舍弃形容词,但您只需要根据自己的需求调整传递给引擎的管道结构即可。
例如,对于您的样本句子,我可以提出这样的结构/解决方案,每个句子给您一个由3个元素(主语、动词、形容词)组成的元组:
import spacy
from textpipeliner import PipelineEngine, Context
from textpipeliner.pipes import *
pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/NNP"),
NamedEntityFilterPipe(),
NamedEntityExtractorPipe()]),
AggregatePipe([FindTokensPipe("VERB"),
FindTokensPipe("VERB/xcomp/VERB/aux/*"),
FindTokensPipe("VERB/xcomp/VERB")]),
AnyPipe([FindTokensPipe("VERB/[acomp,amod]/ADJ"),
AggregatePipe([FindTokensPipe("VERB/[dobj,attr]/NOUN/det/DET"),
FindTokensPipe("VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
]
engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
engine.process()
它将为您提供结果:
[([Donald, Trump], [is], [the, worst])]
稍微有些复杂的是你有复合句,而这个库会针对每个句子生成一个元组 - 我很快就会添加一种可能性(我也需要它来完成我的项目),即向引擎传递一系列的管道结构,以允许生成更多的元组。但现在,您可以通过为复合句创建第二个引擎来解决它,其结构仅与VERB / conj / VERB不同,而不是VERB(这些正则表达式始终从ROOT开始,因此VERB / conj / VERB将带您到复合句中的第二个动词):
pipes_structure_comp = [SequencePipe([FindTokensPipe("VERB/conj/VERB/nsubj/NNP"),
NamedEntityFilterPipe(),
NamedEntityExtractorPipe()]),
AggregatePipe([FindTokensPipe("VERB/conj/VERB"),
FindTokensPipe("VERB/conj/VERB/xcomp/VERB/aux/*"),
FindTokensPipe("VERB/conj/VERB/xcomp/VERB")]),
AnyPipe([FindTokensPipe("VERB/conj/VERB/[acomp,amod]/ADJ"),
AggregatePipe([FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/det/DET"),
FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
]
engine2 = PipelineEngine(pipes_structure_comp, Context(doc), [0,1,2])
现在您运行两个引擎后,您将获得预期的结果:)
engine.process()
engine2.process()
[([Donald, Trump], [is], [the, worst])]
[([Hillary], [is], [better])]
我想这就是你需要的。当然,我只是快速为给定的示例句子创建了一个管道结构,并且它不适用于每种情况,但我看到了很多句子结构,它已经实现了相当大的百分比,但是,对于目前无法处理的情况,您可以添加更多的FindTokensPipe等管道,我相信经过一些调整,您将涵盖大量可能的句子(英语并不太复杂,所以...:)
[(u'donald trump', u'is', u'worst president'), (u'hillary', u'is', u'better')]
)。为了解决这个问题,我将verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"]
更改为verbs = [tok for tok in tokens if tok.pos_ == "VERB" or tok.dep_ != "aux"]
。 - Michael Lilley