使用NLTK提取关系

Question

使用NLTK提取关系

10

这是我之前提问的跟进问题。我正在使用nltk来解析人物、组织和它们之间的关系。使用这个例子，我能够创建人物和组织的块；然而，在nltk.sem.extract_rel命令中，我遇到了一个错误：

AttributeError: 'Tree' object has no attribute 'text'

以下是完整的代码：

import nltk
import re
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
    sample = f.read()

sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences)

# tried plain ne_chunk instead of batch_ne_chunk as given in the book
#chunked_sentences = [nltk.ne_chunk(sentence) for sentence in tagged_sentences]

# pattern to find <person> served as <title> in <org>
IN = re.compile(r'.+\s+as\s+')
for doc in chunked_sentences:
    for rel in nltk.sem.extract_rels('ORG', 'PERSON', doc,corpus='ieer', pattern=IN):
        print nltk.sem.show_raw_rtuple(rel)

这个例子与书中给出的例子非常相似，但是它使用了准备好的“分析文档”，这个概念突然出现，我不知道在哪里可以找到它的对象类型。我已经在git库中搜索过了，任何帮助都会被赞赏。

我的最终目标是提取一些公司的人员、组织和职称（日期），然后创建人员和组织的网络图。

- karlos

你最终解决了这个问题吗？我能看看你的解决方案吗？因为我也遇到了完全相同的问题。 - user3314418

3个回答

5

这是nltk.sem.extract_rels函数的源代码：

def extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10):
"""
Filter the output of ``semi_rel2reldict`` according to specified NE classes and a filler pattern.

The parameters ``subjclass`` and ``objclass`` can be used to restrict the
Named Entities to particular types (any of 'LOCATION', 'ORGANIZATION',
'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE').

:param subjclass: the class of the subject Named Entity.
:type subjclass: str
:param objclass: the class of the object Named Entity.
:type objclass: str
:param doc: input document
:type doc: ieer document or a list of chunk trees
:param corpus: name of the corpus to take as input; possible values are
    'ieer' and 'conll2002'
:type corpus: str
:param pattern: a regular expression for filtering the fillers of
    retrieved triples.
:type pattern: SRE_Pattern
:param window: filters out fillers which exceed this threshold
:type window: int
:return: see ``mk_reldicts``
:rtype: list(defaultdict)
"""
....

如果你将语料库参数传递为ieer，则nltk.sem.extract_rels函数期望doc参数是一个IEERDocument对象。你应该将语料库参数传递为ace，或者不要传递（默认为ace）。在这种情况下，它期望一系列的块树（这就是你想要的）。我对代码进行了以下修改：

import nltk
import re
from nltk.sem import extract_rels,rtuple

#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
    sample = f.read().decode('utf-8')

sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]

# here i changed reg ex and below i exchanged subj and obj classes' places
OF = re.compile(r'.*\bof\b.*')

for i, sent in enumerate(tagged_sentences):
    sent = nltk.ne_chunk(sent) # ne_chunk method expects one tagged sentence
    rels = extract_rels('PER', 'ORG', sent, corpus='ace', pattern=OF, window=7) # extract_rels method expects one chunked sentence
    for rel in rels:
        print('{0:<5}{1}'.format(i, rtuple(rel)))

它会给出结果：

[PER: u'Chairman/NNP'] u'and/CC Chief/NNP Executive/NNP Officer/NNP of/IN the/DT' [ORG: u'Company/NNP']

- cuneyttyler

1

当我复制并粘贴这个示例代码时，我没有得到任何东西，正则表达式正确吗？当我运行它时，它没有给我你的输出。 - john doe

1

我再次运行它并得到了相同的结果。我认为正则表达式是正确的。我真的不知道可能出了什么问题。 - cuneyttyler

1

我所做的唯一一件事就是删除了.decode()，因为我在使用Python3，你认为这与此问题有关吗？ - john doe

1

是的，你应该在Python2中运行这段代码，因为NLTK版本在Python2和Python3中也不同。 - cuneyttyler

1

我用Python2运行了它，但结果还是一样...我没有得到相同的输出，实际上我什么都没有得到...我的nltk版本是3.2.1，有什么建议吗？... - john doe

显示剩余2条评论

0

这是与nltk版本相关的问题。你的代码应该在nltk 2.x中正常工作，但是对于nltk 3，你应该像这样编码

IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
    for rel in nltk.sem.relextract.extract_rels('ORG', 'LOC', doc,corpus='ieer', pattern = IN):
         print (nltk.sem.relextract.rtuple(rel))

NLTK关系抽取示例无法运行

- Mitu Vinci

1

因为缺少解释并且提供了一个与错误上下文不同的问题链接，所以被踩。 - padmalcom

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- bdk · Accepted Answer

看起来一个对象要成为“解析文档”，它需要有一个headline成员和一个text成员，两者都是标记列表，其中一些标记被标记为树形结构。例如，这个（hacky）例子可以工作：

import nltk
import re

IN = re.compile (r'.*\bin\b(?!\b.+ing)')

class doc():
  pass

doc.headline=['foo']
doc.text=[nltk.Tree('ORGANIZATION', ['WHYY']), 'in', nltk.Tree('LOCATION',['Philadelphia']), '.', 'Ms.', nltk.Tree('PERSON', ['Gross']), ',']

for rel in  nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=IN):
   print nltk.sem.relextract.show_raw_rtuple(rel)

运行此代码将输出以下内容：

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']

很明显，你不会真的像这样编写代码，但它提供了一个数据格式的工作示例，extract_rels 期望这种格式。你只需要确定如何进行预处理步骤，以便将您的数据转化成该格式。