如何使用nltk从字符串中提取名称

3

我正在尝试从非结构化字符串中提取名称(印度人)。

以下是我的代码:

text = "Balaji Chandrasekaran Bangalore |  Senior Business Analyst/ Lead Business Analyst An accomplished Senior Business Analyst with a track record of handling complex projects in given period of time, exceeding above the expectation. Successful at developing product road maps and leading cross-functional software teams from prototype to release. Professional Competencies Systems Development Life Cycle (SDLC) Agile methodologies Business process improvement Requirements gathering & Analysis Project Management UML Specification UI & UX (Wireframe Designing) Functional Specification Test Scenario Creation SharePoint Admin Work History Senior Business Analyst (Aug 2012 Current) YouBox Technology pvt ltd, Chennai Translating business goals, feature concepts and customer needs into prioritized product requirements and use cases. Expertized in designing innovative wireframes combining user experience analysis and technology models. Extensive Experience in implementing soft wares for Shipping/Logistics firms to handle CRM, Finance, Logistics, Operations, Intermodal, and documentation. Strong interpersonal skills, highly adept at diplomatically facilitating discussions and negotiations with stakeholders. Education Bachelor of Engineering: Electronics & Communication, 2011 CES Tech Hosur Accomplishment Successful onsite implementation at various locations around the globe for Europe Shipping Company. - (Pre Study, General Design, and Functional Specification) Organized Business Analyst Forum and conducted various activities to develop skill sets of Business Analysts."
if text != "":
    grammar = """PERSON: {<NNP>}"""
    chunkParser = nltk.RegexpParser(grammar)
    tagged = nltk.pos_tag(nltk.word_tokenize(text))
    tree = chunkParser.parse(tagged)

    for subtree in tree.subtrees():
        if subtree.label() == "PERSON": 
            pronouns.append(' '.join([c[0] for c in subtree]))

    print(pronouns)

['Balaji', 'Chandrasekaran', 'Bangalore', '|', '高级业务分析师/领导业务分析师', '成功的开发生命周期SDLC', '敏捷', '业务需求分析', '项目管理', 'UML', '规范', 'UI', 'UX', '线框图设计', '功能规范', '测试场景创建', 'SharePoint管理员', '工作经历', '高级业务分析师', 'Aug', 'Current', 'Technology', 'Chennai', '翻译CRM', '金融', '物流', '运营', '联运', '教育', '学士工程', '电子通信', '成就', '地中海船公司MSC', '乔治亚州MSC', '柬埔寨MSC', '南部MSC', '成功的股份', '日内瓦瑞士MSC', '预研究', '一般设计', '功能规范', 'O', '商业分析师论坛', '商业']

但是实际上我只需要得到巴拉吉·钱德拉塞卡兰,我甚至尝试使用Standford ner lib,但它无法捕获巴拉吉·钱德拉塞卡兰

有谁可以帮助从非结构化字符串中提取名字,或者建议我做这件事的好教程。

先谢谢你了。


1
你可能需要为非拉丁语名字转写成英文查找一个库。我不确定是否存在这样的东西。 - emporerblk
@emporerblk 你是不是想要类似于corpus.names的东西?但是针对印度人的名字。 - Shamily Deenadayalan
没错。Python的姓名数据库已经有一段时间没有更新了(证据),而斯坦福词典是基于西方名字的。如果你想让nltk做你想要的事情,就需要提供印度名字的例子。 - emporerblk
@emporerblk 非常感谢。是否有关于训练或创建印度名字库的教程? - Shamily Deenadayalan
2个回答

1

就像我在评论中所说的那样,您需要创建自己的印度姓名语料库,并对文本进行测试。NLTK图书教您如何在第2章(确切地说是第1.9节)中进行此操作。

from nltk.corpus import PlaintextCorpusReader

# You can use a regular expression to find the files, or pass a list of files
files = ".*\.txt"

new_corpus = PlaintextCorpusReader("/path/", files)
corpus  = nltk.Text(new_corpus.words())

参见:使用NLTK创建新语料库


1
命名实体识别不仅仅是寻找已知的名称;识别器使用一系列线索,包括单词形式和文本结构。您未能识别的名称出现在标题中,而不是连续文本中,因此nltk的识别器(无论如何也不太好)无法找到它。如果您在文本中使用此名称,将会发生什么,请看以下内容:
>>> text = "Balaji Chandrasekaran is a senior business analyst and lives in Bangalore."
>>> words = nltk.word_tokenize(text)
>>> print(nltk.ne_chunk(nltk.pos_tag(words)))
(S
  (PERSON Balaji/NNP)
  Chandrasekaran/NNP
  is/VBZ
  a/DT
  senior/JJ
  business/NN
  analyst/NN
  and/CC
  lives/NNS
  in/IN
  (GPE Bangalore/NNP)
  ./.)

它错过了姓氏(就像我说的,识别器并不是很好),但它能够确定这里有一个名字。

换句话说:你的问题不在于挖掘文本,而在于挖掘简历。唯一好的解决方案是使用一些以与您要处理的格式相同的预先注释的简历来构建和训练识别器。这并不是非常简单的:您需要注释您的训练语料库,并找出您的“特征提取函数”将放置在字典中的词形和文档结构提示的有用特征。 您需要的所有内容都在nltk book的第6章和第7章的各个部分中进行了描述。


谢谢。是的,我正在做这个,同时也在使用name corpus。 - Shamily Deenadayalan
嗨@alexis,你能帮我提取标记为人名的单词吗?我的意思是我只需要balaji。 - dataninsight
for x in data: if isinstance(x, nltk.Tree) and x.label() == 'PERSON': print(" ".join(w[0] for w in x) - alexis

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接