使用nltk从文本文件中提取所有名词

Question

使用nltk从文本文件中提取所有名词

22

有没有更有效的方法来完成这个任务？我的代码读取一个文本文件并提取所有的名词。

import nltk

File = open(fileName) #open file
lines = File.read() #read all lines
sentences = nltk.sent_tokenize(lines) #tokenize sentences
nouns = [] #empty to array to hold all nouns

for sentence in sentences:
     for word,pos in nltk.pos_tag(nltk.word_tokenize(str(sentence))):
         if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
             nouns.append(word)

如何降低此代码的时间复杂度？有没有避免使用嵌套for循环的方法？

提前致谢！

- Rakesh Adhikesavan

将if条件替换为if pos.startswith('NN'):，同时使用set或collections.Counter，不要保留列表。并且使用一些map/reduce代替列表推导式。否则，请尝试使用“浅层解析”（shallow parsing），也称为“块分析”（chunking）。 - alvas

6个回答

30

import nltk

lines = 'lines is some string of words'
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == 'NN'
# do the nlp stuff
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)] 

print nouns
>>> ['lines', 'string', 'words']

有用的提示：通常情况下，列表推导式比在“for”循环中使用.insert()或append()方法逐个添加元素到列表中更快。

- Boa

1

答案是正确的思路。使用这个更加简洁：is_noun = lambda pos: True if pos[:2] == 'NN'。注意：列表推导式并不一定比for循环更快。只是因为你不必实例化一个列表并处理嵌套循环，而是将其作为生成器而不是列表。 - alvas

@alvas - 我没有使用类似 ... pos[:2] == 'NN'... 的东西，因为它可能匹配到不想要的字符串。我不知道，可能会有一个 pos 的值为 'NNA'，我们不想匹配那个。严格来说，True if 和 else False 部分也不是必要的，但我包含它们是为了清晰明了。关于列表推导式不一定比循环更快的好点子（我猜我当时有点轻率）- 我已经相应地编辑了帖子。 - Boa

只是出于好奇，你能给一个“NNA”的例子吗？这样我们就可以在NLTK上对其他与此问题无关的东西进行一些检查 =)。技术上讲，在这个标记集之外不应该有任何标记：https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html。 - alvas

@alvas - 我提出的场景是假设性的，而我想要表达的观点是我事先不知道'pos'变量可能会取什么值（也许我应该说'NNABCDEFG'而不是'NNA'来更清楚地表达这个概念），所以为了安全起见，我按照原始问题中提出的条件参数进行操作。那个条件语句和我提供的任何其他部分都可以根据需要进行修改；我怀疑'pos[:2]'变体和我提出的长条件之间的性能差异相当小。 - Boa

@alvas - 好的 - 我已经编辑了帖子，包括你的建议，以使答案更清晰。干杯 ;) - Boa

17

你可以使用 nltk、Textblob、SpaCy 或其他许多库来取得好的结果。这些库都可以胜任工作，但效率不同。

import nltk
from textblob import TextBlob
import spacy
nlp = spacy.load('en')
nlp1 = spacy.load('en_core_web_lg')

txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""

在我的Windows 10 2核心、4处理器、8GB RAM的i5 惠普笔记本电脑上，在Jupyter Notebook中，我进行了一些比较，并得出了以下结果。

对于TextBlob：

%%time
print([w for (w, pos) in TextBlob(txt).pos_tags if pos[0] == 'N'])

输出结果为

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 8.01 ms #average over 20 iterations

对于nltk：

%%time
print([word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if pos[0] == 'N'])

输出结果为

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 7.09 ms #average over 20 iterations

对于Spacy：

%%time
print([ent.text for ent in nlp(txt) if ent.pos_ == 'NOUN'])

输出结果为

>>> ['language', 'processing', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 30.19 ms #average over 20 iterations

似乎 nltk 和 TextBlob 更快，这是可以预料的，因为它们不存储有关输入文本txt的其他信息。而Spacy则慢得多。还有一件事。在处理NLP时SpaCy错过了名词，而nltk和TextBlob却成功识别了它。除非我需要从输入的txt中提取其他信息，否则我会选择nltk或TextBlob。查看有关spacy的快速入门，请单击此处。查看有关TextBlob的基础知识，请单击此处。查看nltk的操作指南，请单击此处。

- Samuel Nde

2

SpaCy错过了NLP，因为它将其识别为专有名词（PNOUN）。虽然SpaCy具有更多的功能，但速度较慢，但您可以禁用句法分析器并加快速度。 - MrE

5

import nltk
lines = 'lines is some string of words'
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if(pos[:2] == 'NN')]
print (nouns)

请再简单一些。

- Amit Ghosh

4

我不是自然语言处理（NLP）专家，但我认为你已经很接近了，在这些外部循环中可能没有更好的方法来获得二次时间复杂度。

NLTK的最新版本具有内置函数，可以完成您手动执行的操作，nltk.tag.pos_tag_sents，它也返回标记单词的列表。

- Will Angley

4

你的代码没有冗余：你一次读取文件并访问每个句子和每个标记单词，仅一次。无论你如何编写你的代码（如使用综合表达式），你只会隐藏嵌套循环，而不会跳过任何处理。

这段代码唯一可以改进的潜力在于其空间复杂度：你可以分批读取文件，而不是一次性读取整个文件。但是由于你需要一次处理一个完整的句子，所以读取和处理每一行并不简单；除非你的文件有几个GB那么大，否则我不会费心去做；对于小文件来说，不会有任何区别。

总之，你的循环很好。你的代码中有一两件事情可以整理一下（例如匹配POS标记的if子句），但这不会影响效率。

- alexis

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Aziz Alto · Accepted Answer

如果你不限于使用NLTK，可以尝试使用TextBlob。它可以轻松地提取所有名词和名词短语：

>>> from textblob import TextBlob
>>> txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the inter
actions between computers and human (natural) languages."""
>>> blob = TextBlob(txt)
>>> print(blob.noun_phrases)
[u'natural language processing', 'nlp', u'computer science', u'artificial intelligence', u'computational linguistics']