Python多进程 - 文本处理

Question

Python多进程 - 文本处理

3

我正在尝试创建一个多进程版本的文本分类代码。我在这里找到了完整的代码（还有其他酷炫的功能）。我已将完整代码附在下面。

我试过几种方法 - 首先尝试了一个lambda函数，但它抱怨无法序列化，所以我尝试了原始代码的简化版本。

  negids = movie_reviews.fileids('neg')
  posids = movie_reviews.fileids('pos')

  p = Pool(2)
  negfeats =[]
  posfeats =[]

  for f in negids:
   words = movie_reviews.words(fileids=[f]) 
   negfeats = p.map(featx, words) #not same form as below - using for debugging

  print len(negfeats)

很遗憾，即使这样也不起作用 - 我得到以下跟踪信息：

File "/usr/lib/python2.6/multiprocessing/pool.py", line 148, in map
    return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.6/multiprocessing/pool.py", line 422, in get
    raise self._value
ZeroDivisionError: float division

有什么想法我可能做错了吗？我应该使用pool.apply_async吗？（仅仅使用它似乎也不能解决问题，但也许我方向错了）？

import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def evaluate_classifier(featx):
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')

    negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
    posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

    negcutoff = len(negfeats)*3/4
    poscutoff = len(posfeats)*3/4

    trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
    testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

    classifier = NaiveBayesClassifier.train(trainfeats)
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)

    for i, (feats, label) in enumerate(testfeats):
            refsets[label].add(i)
            observed = classifier.classify(feats)
            testsets[observed].add(i)

    print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
    print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
    print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
    print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
    print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
    classifier.show_most_informative_features()

- malangi

2个回答

1

你是想并行分类、训练还是两者都要？单词计数和评分可能很容易并行化，但特征提取和训练就不确定了。对于分类，我建议使用execnet。我在并行/分布式词性标注方面使用它取得了良好的结果。

execnet 的基本思路是，你只需训练一次单个分类器，然后将其发送到每个 execnet 节点。接下来，将文件分配给每个节点，然后让它对其获得的每个文件进行分类。然后将结果发送回主节点。我还没有尝试过将分类器序列化，所以我不确定这是否有效，但如果一个 pos 标注器可以被序列化，我认为分类器也可以。

- Jacob

我刚开始尝试使用pickling - 它们变得相当沉重（大约100MB）。我会尝试看看是否可以让多进程以某种方式工作，否则execnet似乎是一种替代方案 - 我怀疑培训可以轻松地并行化，但像你说的那样，其他小部分应该不会太难...希望如此。顺便说一句，感谢streamhacker上的东西 - 这是一个宝藏！ - malangi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Vin-G · Accepted Answer

关于您的简化版本，您是否使用了与http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/中使用的不同的featx函数？

异常很可能发生在featx内部，而多进程只是重新引发它，尽管它实际上并没有包括原始的回溯信息，这使得它有点不太有用。

首先尝试不使用pool.map()运行它（即negfeats = [feat(x) for x in words]），或者在featx中包含一些可以调试的内容。

如果仍然无法解决问题，请在原始问题中发布您正在处理的整个脚本（如果可能，已经简化），以便其他人可以运行它并提供更具针对性的答案。请注意，以下代码片段实际上可以工作（适应您的简化版本）：

from nltk.corpus import movie_reviews
from multiprocessing import Pool

def featx(words):
    return dict([(word, True) for word in words])

if __name__ == "__main__":
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')

    p = Pool(2)
    negfeats =[]
    posfeats =[]

    for f in negids:
        words = movie_reviews.words(fileids=[f]) 
        negfeats = p.map(featx, words)

    print len(negfeats)