IOB/BIO意味着Inside, Outside, Beginning(IOB),有时也称为Beginning, Inside, Outside(BIO)
斯坦福NE标记器以IOB/BIO风格返回标记,例如:
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
('Rami', 'PERSON'), ('Eid', 'PERSON')
被标记为 PERSON,"Rami" 是 NE 块的开头,而 "Eid" 是内部。然后你会看到任何非-NE 都会被标记为 "O"。
提取连续的 NE 块的想法与使用正则表达式进行命名实体识别:NLTK非常相似,但由于 Stanford NE 块分析器 API 不返回易于解析的树形结构,因此您需要执行以下操作:
def get_continuous_chunks(tagged_sent):
continuous_chunk = []
current_chunk = []
for token, tag in tagged_sent:
if tag != "O":
current_chunk.append((token, tag))
else:
if current_chunk:
continuous_chunk.append(current_chunk)
current_chunk = []
if current_chunk:
continuous_chunk.append(current_chunk)
return continuous_chunk
ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
named_entities_str_tag = [(" ".join([token for token, tag in ne]), ne[0][1]) for ne in named_entities]
print named_entities
print
print named_entities_str
print
print named_entities_str_tag
print
[输出]:
[[('Rami', 'PERSON'), ('Eid', 'PERSON')], [('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION')], [('NY', 'LOCATION')]]
['Rami Eid', 'Stony Brook University', 'NY']
[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]
请注意以下限制:如果两个命名实体是连续的,那么可能会出错,尽管我仍然想不出任何在它们之间没有任何“O”的情况下两个命名实体是连续的例子。
正如@alexis所建议的那样,最好将Stanford NE输出转换为NLTK树:
from nltk import pos_tag
from nltk.chunk import conlltags2tree
from nltk.tree import Tree
def stanfordNE2BIO(tagged_sent):
bio_tagged_sent = []
prev_tag = "O"
for token, tag in tagged_sent:
if tag == "O":
bio_tagged_sent.append((token, tag))
prev_tag = tag
continue
if tag != "O" and prev_tag == "O":
bio_tagged_sent.append((token, "B-"+tag))
prev_tag = tag
elif prev_tag != "O" and prev_tag == tag:
bio_tagged_sent.append((token, "I-"+tag))
prev_tag = tag
elif prev_tag != "O" and prev_tag != tag:
bio_tagged_sent.append((token, "B-"+tag))
prev_tag = tag
return bio_tagged_sent
def stanfordNE2tree(ne_tagged_sent):
bio_tagged_sent = stanfordNE2BIO(ne_tagged_sent)
sent_tokens, sent_ne_tags = zip(*bio_tagged_sent)
sent_pos_tags = [pos for token, pos in pos_tag(sent_tokens)]
sent_conlltags = [(token, pos, ne) for token, pos, ne in zip(sent_tokens, sent_pos_tags, sent_ne_tags)]
ne_tree = conlltags2tree(sent_conlltags)
return ne_tree
ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'),
('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'),
('in', 'O'), ('NY', 'LOCATION')]
ne_tree = stanfordNE2tree(ne_tagged_sent)
print ne_tree
[out]:
(S
(PERSON Rami/NNP Eid/NNP)
is/VBZ
studying/VBG
at/IN
(ORGANIZATION Stony/NNP Brook/NNP University/NNP)
in/IN
(LOCATION NY/NNP))
然后:
ne_in_sent = []
for subtree in ne_tree:
if type(subtree) == Tree:
ne_label = subtree.label()
ne_string = " ".join([token for token, pos in subtree.leaves()])
ne_in_sent.append((ne_string, ne_label))
print ne_in_sent
[输出]:
[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]