首先,仔细阅读这些答案,它们包含您所需的答案部分,并简要解释分类器在NLTK中的作用和工作原理:
在已标注的数据上对分类器进行测试
现在来回答你的问题。我们假设你的问题是关于这个问题的后续:在NLTK中使用自己的语料库进行分类,而非使用movie_reviews语料库
如果你的测试文本与movie_review
语料库的结构相同,那么你可以像处理训练数据一样读取测试数据:
如果代码的解释不清楚,下面是一个演示:
traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
上面的两行代码是用来读取一个名为
my_movie_reviews
的目录,该目录结构如下:
\my_movie_reviews
\pos
123.txt
234.txt
\neg
456.txt
789.txt
README
接下来的代码会提取具有 pos/neg
标签的文档,这些文档是目录结构的一部分。
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
这是上面那行的解释:
labels = [i for i.split('/')[0]) for i in mr.fileids()]
words = [w for w in mr.words(i)]
words = [w for w in mr.words(i) if w.lower() not in stop]
words = [w for w in mr.words(i) w not in string.punctuation]
words = [w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation]
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
当您阅读测试数据时,应使用相同的过程!!!
现在来看特征处理:
以下行会为分类器提供额外的前100个特征:
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
在将文档处理为可分类格式之后:
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
现在解释一下针对
train_set
和 `test_set 的长列表推导式:
train_docs = documents[:numtrain]
test_docs = documents[numtrain:]
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in train_docs]
你需要按照以上步骤处理测试文档以进行特征提取!!!
以下是读取测试数据的步骤:
stop = stopwords.words('english')
traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
testdir = '/home/alvas/test_reviews'
mr_test = CategorizedPlaintextCorpusReader(testdir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
test_documents = [([w for w in mr_test.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr_test.fileids()]
接着按照上述处理步骤继续进行,只需像 @yvespeirsman 所回答的那样执行此操作即可获取测试文档的标签。
stop = stopwords.words('english')
traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
classifier = NaiveBayesClassifier.train(train_set)
testdir = '/home/alvas/test_reviews'
mr_test = CategorizedPlaintextCorpusReader(testdir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
test_documents = [([w for w in mr_test.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr_test.fileids()]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in test_documents]
for doc, gold_label in test_set:
tagged_label = classifier.classify(doc)
if tagged_label == gold_label:
print("Woohoo, correct")
else:
print("Boohoo, wrong")
如果上述代码和解释对您来说没有意义,那么在继续之前,您必须阅读此教程:http://www.nltk.org/howto/classify.html
现在假设你的测试数据中没有注释,即你的test.txt
不像movie_review
那样位于目录结构中,只是一个普通的文本文件:
\test_movie_reviews
\1.txt
\2.txt
那么将其读入分类语料库就没有意义了,您只需读取并标记文档即可,例如:
for infile in os.listdir(`test_movie_reviews):
for line in open(infile, 'r'):
tagged_label = classifier.classify(doc)
但是
没有注释,你就不能评估结果,所以如果使用
if-else
标签,你无法检查标签,此外,如果未使用CategorizedPlaintextCorpusReader,则需要对文本进行分词。如果只想标记纯文本文件
test.txt
:
import string
from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk import word_tokenize
stop = stopwords.words('english')
documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
classifier = NaiveBayesClassifier.train(train_set)
with open('test.txt', 'r') as fin:
for test_sentence in fin:
doc = word_tokenize(test_sentence.lower())
featurized_doc = {i:(i in doc) for i in word_features}
tagged_label = classifier.classify(featurized_doc)
print(tagged_label)
再次强调,请勿只是复制粘贴解决方案,要尽力理解为何以及如何运作。
cv081.txt
被分类为“pos”,那还有什么需要解释的呢? - alexis