如何使用NLTK构建带有词性标注的语料库？

Question

如何使用NLTK构建带有词性标注的语料库？

3

我尝试从外部.txt文件中构建一个POS标记的语料库，用于分块以及实体和关系提取。到目前为止，我已经找到了一种繁琐的多步解决方案：

Read files with into a plain text corpus:

from nltk.corpus.reader import PlaintextCorpusReader
my_corp = PlaintextCorpusReader(".", r".*\.txt")

Tag corpus with built-in Penn POS-tagger:

my_tagged_corp= nltk.batch_pos_tag(my_corp.sents())

（顺便说一下，在这个时候Python报错了：NameError: name 'batch' is not defined）

Write out tagged sentences into file:

taggedfile = open("output.txt" , "w")
for sent in dd_tagged:
    line = " ".join( w+"/"+t for (w, t) in sent )
taggedfile.write(line + "\n")
taggedfile.close ()

And finally, read this output again as tagged corpus:

from nltk.corpus.reader import TaggedCorpusReader
my_corpus2 = TaggedCorpusReader(".",r"output.txt")

对于一个非常普遍的任务（分块总是需要标记过的语料库），这一切都非常不方便。我的问题是：是否有更紧凑、更优雅的实现方式？例如，同时获取原始输入文件和标记器的语料库阅读器？

- Hendrik

在这一点上，Python 抛出了一个错误，因此上面的代码没有完成？还是你是怎么得到输出的？ - patrick

@hendrik：我在创建POS标记语料库时遇到了问题。我可以运行您第4步的Python代码，但我想从nltk_data/corpora文件夹导入我的语料库。您能否提供详细的创建步骤？ - Aditi

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Aditi · Accepted Answer

我已经找到了解决方案：请参考链接，按照步骤进行操作。

从这里下载所需文件。

按照1中的命令后，将生成pickle文件，这是您的标记语料库。

一旦pickle文件生成，您可以通过运行以下代码检查您的标签器是否正常工作：

import nltk.data
tagger = nltk.data.load("taggers/NAME_OF_TAGGER.pickle")
tagger.tag(['some', 'words', 'in', 'a', 'sentence'])