如何使用spacy nlp找到专有名词

3
我正在使用Spacy构建关键词提取器。我要查找的关键词是以下文本中的OpTic Gaming
“该公司也是OpTic Gaming在2017年首次《使命召唤》锦标赛中获胜的传奇组织之一的主要赞助商之一。”
如何从这段文本中解析出OpTic Gaming?如果使用名词短语,我会得到OpTic Gaming's main sponsors sponsors,如果使用tokens,我会得到["OpTic", "Gaming", "'s"]。
import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")

for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

该公司是OpTic Gaming的主要赞助商,支持他们参加第一次使命召唤锦标赛。

2个回答

6

Spacy可以为您提取词性(如专有名词、决定词、动词等),您可以通过token.pos_访问它们。

在您的情况下:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")

for tok in doc:
    print(tok, tok.pos_)

...

一个 NUM

的 ADP

OpTic PROPN

Gaming PROPN

...

然后,您可以过滤专有名词,将连续的专有名词分组,并对文档进行切片以获取名词组:

def extract_proper_nouns(doc):
    pos = [tok.i for tok in doc if tok.pos_ == "PROPN"]
    consecutives = []
    current = []
    for elt in pos:
        if len(current) == 0:
            current.append(elt)
        else:
            if current[-1] == elt - 1:
                current.append(elt)
            else:
                consecutives.append(current)
                current = [elt]
    if len(current) != 0:
        consecutives.append(current)
    return [doc[consecutive[0]:consecutive[-1]+1] for consecutive in consecutives]

提取专有名词(doc)

[OpTic Gaming, Duty Championship]

更多详情请参考: https://spacy.io/usage/linguistic-features


0
import spacy

nlp = spacy.load("en_core_web_sm")
text = "New Delhi is a Capital of India"

doc = nlp(text)

full_entities = {}
for ent in doc.ents:
    if ent.label_ in ["PERSON", "ORG", "GPE"] and " " in ent.text:
        if ent.label_ not in full_entities:
            full_entities[ent.label_] = []
        full_entities[ent.label_].append(ent.text)

if not full_entities:
    proper_nouns = [token.text for token in doc if token.pos_ == "PROPN"]
    for i, token in enumerate(proper_nouns[:-1]):
        if proper_nouns[i+1].istitle() and not token.endswith("."):
            if "PERSON" not in full_entities:
                full_entities["PERSON"] = []
            full_entities["PERSON"].append(token + " " + proper_nouns[i+1])

print(full_entities)

回答需要更多的支持性信息。请编辑以添加进一步的细节,例如引用或文档,以便他人可以确认您的答案是否正确。您可以在帮助中心找到关于如何撰写好答案的更多信息。 - moken

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接