编辑:可运行的正确代码版本位于以下链接:https://github.com/a7x/NaiveBayes-Classifier
我使用了来自openClassroom的数据,并开始在Python中开发一个小型的朴素贝叶斯分类器。步骤是通常的训练和预测。我有几个问题,想知道为什么准确率相当糟糕。
For training, I calculated the log likelihood by the formula :
log( P ( word | spam ) +1 ) /( spamSize + vocabSize .)
My question is: why did we add the
vocabSize
in this case :( and is this the correct way of going about it? Code used is below:#This is for training. Calculate all probabilities and store them in a vector. Better to store it in a file for easier access from __future__ import division import sys,os ''' 1. The spam and non-spam is already 50% . So they by default are 0.5 2. Now we need to calculate probability of each word , in spam and non-spam separately 2.1 we can make two dictionaries, defaultdicts basically, for spam and non-spam 2.2 When time comes to calculate probabilities, we just need to substitute values ''' from collections import * from math import * spamDict = defaultdict(int) nonspamDict = defaultdict(int) spamFolders = ["spam-train"] nonspamFolders = ["nonspam-train"] path = sys.argv[1] #Base path spamVector = open(sys.argv[2],'w') #WRite all spam values into this nonspamVector = open(sys.argv[3],'w') #Non-spam values #Go through all files in spam and iteratively add values spamSize = 0 nonspamSize = 0 vocabSize = 264821 for f in os.listdir(os.path.join(path,spamFolders[0])): data = open(os.path.join(path,spamFolders[0],f),'r') for line in data: words = line.split(" ") spamSize = spamSize + len(words) for w in words: spamDict[w]+=1 for f in os.listdir(os.path.join(path,nonspamFolders[0])): data = open(os.path.join(path,nonspamFolders[0],f),'r') for line in data: words = line.split(" ") nonspamSize = nonspamSize + len(words) for w in words: nonspamDict[w]+=1 logProbspam = {} logProbnonSpam = {} #This is to store the log probabilities for k in spamDict.keys(): #Need to calculate P(x | y = 1) numerator = spamDict[k] + 1 # Frequency print 'Word',k,' frequency',spamDict[k] denominator = spamSize + vocabSize p = log(numerator/denominator) logProbspam[k] = p for k in nonspamDict.keys(): numerator = nonspamDict[k] + 1 #frequency denominator = nonspamSize + vocabSize p = log(numerator/denominator) logProbnonSpam[k] = p for k in logProbnonSpam.keys(): nonspamVector.write(k+" "+str(logProbnonSpam[k])+"\n") for k in logProbspam.keys(): spamVector.write(k+" "+str(logProbspam[k])+"\n")
For prediction, I just took a mail , split it into words, added all the probabilities, separately for spam/non-spam, and multiplied them by 0.5. Whichever was higher was the class label. Code is below:
http://pastebin.com/8Y6Gm2my ( Stackoverflow was again playing games for some reason :-/)
编辑:我已经移除了 spam = spam + 1的内容。取而代之的是,我会忽略掉那些单词。
问题:我的准确率非常低,如下所述。
No of files in spam is 130
No. of spam in ../NaiveBayes/spam-test is 53 no. of non-spam 77
No of files in non-spam is 130
No. of spam in ../NaiveBayes/nonspam-test/ is 6 no. of non-spam 124
请告诉我我的错误在哪里。我觉得低于50%的准确率意味着实现中一定存在一些明显的错误。