Python - 使用NLTK搜索文本

3
我有一个文本文件(你可以从这里下载,这里)。
我正在尝试在文件中搜索单词language。为此,我拥有以下Python脚本:
import nltk

file = open('NLTK.txt', 'r')
read_file = file.read()
text = nltk.Text(read_file)
match = text.concordance('language')
print(match)

当我运行程序时,尽管文件中包含单词language,但我得到了以下输出:
No matches
None

为什么程序无法找到文件中存在的单词“language”?请注意,语句text = nltk.Text(read_file)返回:
<Text: T h i s   i s  ...>

感谢您的选择。

1
被接受的答案关于如何解决问题是正确的,但这里有另一个建议:不要费心学习如何使用“Text”类;它只是为交互式探索和演示而设计的。直接使用“PlaintextCorpusReader”(以及其用于注释格式的对应项)。 - alexis
1个回答

6

我相信你需要首先进行分词,以处理原始文本(如第3章所述)。分词,然后进行处理,可以给出你的示例文本中的结果。

import nltk

file = open('NLTK.txt', 'r')
read_file = file.read()
text = nltk.Text(nltk.word_tokenize(read_file))

match = text.concordance('language')

或者,您可以使用nltk语料库阅读器进行分词和处理,如下所示:

import nltk
from nltk.corpus import PlaintextCorpusReader

corp = PlaintextCorpusReader(r'C:/', 'NLTK.txt')
text = nltk.Text(corp.words())

match = text.concordance('language')

比赛结果;

Displaying 18 of 18 matches:
                                   Language Processing . By `` natural languag
                                   language '' we mean a language that is used
                                   language that is used for everyday communic
licit rules . We will take Natural Language Processing ・or NLP for short ・in a
f computer manipulation of natural language . At one extreme , it could be as
ted access to stored information , language processing has come to play a cent
e textbook for a course on natural language processing or computational lingui
is based on the Python programming language together with an open source libra
 source library called the Natural Language Toolkit ( NLTK ) . NLTK includes e
s are deployed in a variety of new language technologies . For this reason it
rite programs that analyze written language , regardless of previous programmi
is book to get immersed in natural language processing . All relevant Python f
ty for this application area . The language index will help you locate relevan
mples and dig into the interesting language analysis material that starts in 1
 text using Python and the Natural Language Toolkit . To learn about advanced
an help you manipulate and analyze language data , and how to write these prog
s are used to describe and analyse language How data structures and algorithms
and algorithms are used in NLP How language data is stored in standard formats

哪一个更快,两个中的哪一个? - Arnav Das

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接