scikit-learn随机森林分类器的概率预测与多数投票

Question

scikit-learn随机森林分类器的概率预测与多数投票

4

在scikit-learn的文档中，第1.9.2.1节（下面是摘录），为什么随机森林的实现与Breiman的原始论文不同？据我所知，Breiman在聚合分类器时选择了多数表决（mode），在回归时选择了平均值（由原始R代码的维护者Liaw和Wiener撰写的论文，引用如下）。

1. 为什么scikit-learn使用概率预测而不是多数表决？

2. 使用概率预测有什么优势吗？

有待商榷的部分如下：

与原始出版物[B2001]相比，scikit-learn的实现通过平均它们的概率预测来组合分类器，而不是让每个分类器为单个类投票。

来源：Liaw，A.和Wiener，M。（2002）。通过randomForest进行分类和回归。R新闻，2（3），18-22。

- William

2个回答

1

这是由Breiman在袋装预测器中进行研究的（http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf）。使用软投票可以得到几乎相同的结果，但可以获得更平滑的概率。请注意，如果您正在使用完全发展的树，则不会有任何区别。

- Arnaud Joly

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- AN6U5 · Accepted Answer

这个问题现在已经在Cross Validated上得到了回答。

这里仅供参考：

点击查看

Such questions are always best answered by looking at the code, if you're fluent in Python.

RandomForestClassifier.predict, at least in the current version 0.16.1, predicts the class with highest probability estimate, as given by predict_proba. (this line)

The documentation for predict_proba says:

The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.

The difference from the original method is probably just so that predict gives predictions consistent with predict_proba. The result is sometimes called "soft voting", rather than the "hard" majority vote used in the original Breiman paper. I couldn't in quick searching find an appropriate comparison of the performance of the two methods, but they both seem fairly reasonable in this situation.

The predict documentation is at best quite misleading; I've submitted a pull request to fix it.

If you want to do majority vote prediction instead, here's a function to do it. Call it like predict_majvote(clf, X) rather than clf.predict(X). (Based on predict_proba; only lightly tested, but I think it should work.)
from scipy.stats import mode
from sklearn.ensemble.forest import _partition_estimators, _parallel_helper
from sklearn.tree._tree import DTYPE
from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted

def predict_majvote(forest, X):
    """Predict class for X.

    Uses majority voting, rather than the soft voting scheme
    used by RandomForestClassifier.predict.

    Parameters
    ----------
    X : array-like or sparse matrix of shape = [n_samples, n_features]
        The input samples. Internally, it will be converted to
        ``dtype=np.float32`` and if a sparse matrix is provided
        to a sparse ``csr_matrix``.
    Returns
    -------
    y : array of shape = [n_samples] or [n_samples, n_outputs]
        The predicted classes.
    """
    check_is_fitted(forest, 'n_outputs_')

    # Check data
    X = check_array(X, dtype=DTYPE, accept_sparse="csr")

    # Assign chunk of trees to jobs
    n_jobs, n_trees, starts = _partition_estimators(forest.n_estimators,
                                                    forest.n_jobs)

    # Parallel loop
    all_preds = Parallel(n_jobs=n_jobs, verbose=forest.verbose,
                         backend="threading")(
        delayed(_parallel_helper)(e, 'predict', X, check_input=False)
        for e in forest.estimators_)

    # Reduce
    modes, counts = mode(all_preds, axis=0)

    if forest.n_outputs_ == 1:
        return forest.classes_.take(modes[0], axis=0)
    else:
        n_samples = all_preds[0].shape[0]
        preds = np.zeros((n_samples, forest.n_outputs_),
                         dtype=forest.classes_.dtype)
        for k in range(forest.n_outputs_):
            preds[:, k] = forest.classes_[k].take(modes[:, k], axis=0)
        return preds
On the dumb synthetic case I tried, predictions agreed with the predict method every time.