如何在Python中从nltk.book读取nltk.text.Text文件？

Question

如何在Python中从nltk.book读取nltk.text.Text文件？

3

我正在学习使用nltk进行自然语言处理，它可以完成许多任务，但是我无法找到从该软件包中读取文本的方法。我尝试过类似以下的方法：

from nltk.book import *
text6 #Brings the title of the text
open(text6).read()
#or
nltk.book.text6.read()

但似乎它不起作用，因为它没有文件ID。似乎没有人在此之前提出过这个问题，所以我认为答案应该很容易。您知道如何阅读这些文本或将其转换为字符串的方法吗？谢谢。

- Juan C

不错的发现！啊，文档上有点空缺 =) - alvas

3个回答

2

看起来他们已经为您将其分成了标记。

from nltk.book import text6

text6.tokens

- Jon

0

#生成排序后的令牌

print(sorted(set(text6))

- Johnny

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- alvas · Accepted Answer

让我们深入代码 =)

首先，nltk.book 代码位于 https://github.com/nltk/nltk/blob/develop/nltk/book.py

如果我们仔细看，文本是作为 nltk.Text 对象加载的，例如对于从https://github.com/nltk/nltk/blob/develop/nltk/book.py#L36 加载的 text6：

text6 = Text(webtext.words('grail.txt'), name="Monty Python and the Holy Grail")

Text对象来自于https://github.com/nltk/nltk/blob/develop/nltk/text.py#L286，你可以从http://www.nltk.org/book/ch02.html了解更多关于如何使用它的信息。 webtext是来自于nltk.corpus的语料库，所以要获取nltk.book.text6的原始文本，你可以直接加载webtext，例如：

>>> from nltk.corpus import webtext
>>> webtext.raw('grail.txt')

只有在加载PlaintextCorpusReader对象时才会出现fileids，而不是从Text对象（已处理的对象）中获取：

>>> type(webtext)
<class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>
>>> for filename in webtext.fileids():
...     print(filename)
... 
firefox.txt
grail.txt
overheard.txt
pirates.txt
singles.txt
wine.txt