在特定文件上测试NLTK分类器

8
以下代码运行朴素贝叶斯电影评论分类器。该代码生成最具信息量的特征列表。
注意:**电影评论**文件夹位于nltk中。
from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
stop = stopwords.words('english')

documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()]


word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents[numtrain:]]

classifier = NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

代码链接来自alvas

我如何在特定文件上测试分类器?

如果我的问题含糊不清或有误,请告诉我。

2个回答

8

首先,仔细阅读这些答案,它们包含您所需的答案部分,并简要解释分类器在NLTK中的作用和工作原理:


在已标注的数据上对分类器进行测试

现在来回答你的问题。我们假设你的问题是关于这个问题的后续:在NLTK中使用自己的语料库进行分类,而非使用movie_reviews语料库

如果你的测试文本与movie_review语料库的结构相同,那么你可以像处理训练数据一样读取测试数据:

如果代码的解释不清楚,下面是一个演示:

traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')

上面的两行代码是用来读取一个名为my_movie_reviews的目录,该目录结构如下:
\my_movie_reviews
    \pos
        123.txt
        234.txt
    \neg
        456.txt
        789.txt
    README

接下来的代码会提取具有 pos/neg 标签的文档,这些文档是目录结构的一部分。

documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

这是上面那行的解释:
# This extracts the pos/neg tag
labels = [i for i.split('/')[0]) for i in mr.fileids()]
# Reads the words from the corpus through the CategorizedPlaintextCorpusReader object
words = [w for w in mr.words(i)]
# Removes the stopwords
words = [w for w in mr.words(i) if w.lower() not in stop]
# Removes the punctuation
words = [w for w in mr.words(i) w not in string.punctuation]
# Removes the stopwords and punctuations
words = [w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation]
# Removes the stopwords and punctuations and put them in a tuple with the pos/neg labels
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

当您阅读测试数据时,应使用相同的过程!!!

现在来看特征处理:

以下行会为分类器提供额外的前100个特征:

# Extract the words features and put them into FreqDist
# object which records the no. of times each unique word occurs
word_features = FreqDist(chain(*[i for i,j in documents]))
# Cuts the FreqDist to the top 100 words in terms of their counts.
word_features = word_features.keys()[:100]

在将文档处理为可分类格式之后:

# Splits the training data into training size and testing size
numtrain = int(len(documents) * 90 / 100)
# Process the documents for training data
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
# Process the documents for testing data
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents[numtrain:]]

现在解释一下针对 train_set 和 `test_set 的长列表推导式:
# Take the first `numtrain` no. of documents
# as training documents
train_docs = documents[:numtrain]
# Takes the rest of the documents as test documents.
test_docs = documents[numtrain:]
# These extract the feature sets for the classifier
# please look at the full explanation on https://dev59.com/VmEi5IYBdhLWcg3w6f6e
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in train_docs]

你需要按照以上步骤处理测试文档以进行特征提取!!!
以下是读取测试数据的步骤:
stop = stopwords.words('english')

# Reads the training data.
traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')

# Converts training data into tuples of [(words,label), ...]
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

# Now do the same for the testing data.
testdir = '/home/alvas/test_reviews'
mr_test = CategorizedPlaintextCorpusReader(testdir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
# Converts testing data into tuples of [(words,label), ...]
test_documents = [([w for w in mr_test.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr_test.fileids()]

接着按照上述处理步骤继续进行,只需像 @yvespeirsman 所回答的那样执行此操作即可获取测试文档的标签。
#### FOR TRAINING DATA ####
stop = stopwords.words('english')

# Reads the training data.
traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')

# Converts training data into tuples of [(words,label), ...]
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
# Extract training features.
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
# Assuming that you're using full data set
# since your test set is different.
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents]

#### TRAINS THE TAGGER ####
# Train the tagger
classifier = NaiveBayesClassifier.train(train_set)

#### FOR TESTING DATA ####
# Now do the same reading and processing for the testing data.
testdir = '/home/alvas/test_reviews'
mr_test = CategorizedPlaintextCorpusReader(testdir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
# Converts testing data into tuples of [(words,label), ...]
test_documents = [([w for w in mr_test.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr_test.fileids()]
# Reads test data into features:
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in test_documents]

#### Evaluate the classifier ####
for doc, gold_label in test_set:
    tagged_label = classifier.classify(doc)
    if tagged_label == gold_label:
        print("Woohoo, correct")
    else:
        print("Boohoo, wrong")

如果上述代码和解释对您来说没有意义,那么在继续之前,您必须阅读此教程:http://www.nltk.org/howto/classify.html


现在假设你的测试数据中没有注释,即你的test.txt不像movie_review那样位于目录结构中,只是一个普通的文本文件:

\test_movie_reviews
    \1.txt
    \2.txt

那么将其读入分类语料库就没有意义了,您只需读取并标记文档即可,例如:

for infile in os.listdir(`test_movie_reviews): 
  for line in open(infile, 'r'):
       tagged_label = classifier.classify(doc)

但是没有注释,你就不能评估结果,所以如果使用if-else标签,你无法检查标签,此外,如果未使用CategorizedPlaintextCorpusReader,则需要对文本进行分词。如果只想标记纯文本文件test.txt
import string
from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk import word_tokenize

stop = stopwords.words('english')

# Extracts the documents.
documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()]
# Extract the features.
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
# Converts documents to features.
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
# Train the classifier.
classifier = NaiveBayesClassifier.train(train_set)

# Tag the test file.
with open('test.txt', 'r') as fin:
    for test_sentence in fin:
        # Tokenize the line.
        doc = word_tokenize(test_sentence.lower())
        featurized_doc = {i:(i in doc) for i in word_features}
        tagged_label = classifier.classify(featurized_doc)
        print(tagged_label)

再次强调,请勿只是复制粘贴解决方案,要尽力理解为何以及如何运作。


谢谢您的详细解释,我会尽力理解。但是我经常遇到错误的结果。我的意思是应该是“pos”,但程序显示为“neg”。我不知道原因。 - ZaM
1
有很多原因,並不完美,可能是(i)數據不足,(ii)特徵不夠好,(iii)分類器選擇等。請參加https://www.coursera.org/course/ml以獲取更多信息。如果可以的話,我強烈鼓勵您參加http://lxmls.it.pt/2015/。 - alvas
3
你需要通过评估输出的正确性来判断其质量。分类器会学习哪些特征是有用的,并且学会如何将它们组合起来做出决策。这并没有逻辑上的规则,而是基于统计和加权。如果你使用的特征集使得文件cv081.txt被分类为“pos”,那还有什么需要解释的呢? - alexis
1
浏览Coursera链接上的机器学习课程,你就会明白分类器为什么以及如何工作。我最初将它们视为黑匣子,一旦你了解它们如何产生注释,编码和欣赏它们的优雅就更容易了。 - alvas
1
第一种情况是当您有带注释的数据进行测试时,第二种情况是当您没有数据时。如果您需要我们验证代码的输出,您可以将完整的数据集发布在某个地方以便我们在空闲时间进行测试吗? - alvas
显示剩余3条评论

4
您可以使用classifier.classify()方法在一个文件上进行测试。该方法的输入是一个字典,其中特征作为键,True或False作为其值,具体取决于文档中是否出现该特征。它根据分类器输出文件的最可能标签。然后,您可以将此标签与文件的正确标签进行比较,以查看分类是否正确。
在您的训练和测试集中,特征字典始终是元组中的第一项,标签是元组中的第二项。
因此,您可以像这样对测试集中的第一个文档进行分类:
(my_document, my_label) = test_set[0]
if classifier.classify(my_document) == my_label:
    print "correct!"
else:
    print "incorrect!"

请您能否给我一个完整的例子,如果可能的话,最好是根据我提出的问题来编写。我在Python方面非常新手,请问为什么您要在test_set[0]中写入0 - ZaM
2
这是一个完整的示例:如果您将代码立即粘贴到问题中的代码后面,它将正常工作。 0 只是获取测试集中的第一个文档(列表中的第一项索引为 0)。 - yvespeirsman
非常感谢。有没有办法在test_set [0]中写入name_of_file而不是0?我不知道test_set确切地指示哪个文件,因为我们有2个文件夹pos | neg,每个文件夹都有自己的文件。我之所以问这个问题,是因为“bad”是“最具信息量”的单词(我的问题的结果)。第一个文件有一百多个“bad”单词。但程序在输出中显示“不正确”。我的错误在哪里? - ZaM
首先,test_set 不包含文件名,所以如果您想使用它来识别文件,则一种方法是直接读取文件并将其作为我上面描述的特征字典传递给分类器。其次,您当前的分类器使用二进制特征。它只检查单词是否在文档中出现,但忽略单词出现的频率。这可能是为什么它会误分类具有许多“bad”出现的文件的原因。 - yvespeirsman

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接