我正在尝试创建一个多进程版本的文本分类代码。我在这里找到了完整的代码(还有其他酷炫的功能)。我已将完整代码附在下面。
我试过几种方法 - 首先尝试了一个lambda函数,但它抱怨无法序列化,所以我尝试了原始代码的简化版本。
很遗憾,即使这样也不起作用 - 我得到以下跟踪信息:
我试过几种方法 - 首先尝试了一个lambda函数,但它抱怨无法序列化,所以我尝试了原始代码的简化版本。
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
p = Pool(2)
negfeats =[]
posfeats =[]
for f in negids:
words = movie_reviews.words(fileids=[f])
negfeats = p.map(featx, words) #not same form as below - using for debugging
print len(negfeats)
很遗憾,即使这样也不起作用 - 我得到以下跟踪信息:
File "/usr/lib/python2.6/multiprocessing/pool.py", line 148, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.6/multiprocessing/pool.py", line 422, in get
raise self._value
ZeroDivisionError: float division
有什么想法我可能做错了吗?我应该使用pool.apply_async
吗?(仅仅使用它似乎也不能解决问题,但也许我方向错了)?
import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def evaluate_classifier(featx):
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
classifier = NaiveBayesClassifier.train(trainfeats)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(testfeats):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
classifier.show_most_informative_features()