sklearn随机森林：.oob_score_太低？

Question

sklearn随机森林：.oob_score_太低？

scikit-learnclassificationrandom-forestcross-validation

5

我正在寻找随机森林应用程序，并在Kaggle上发现了以下知识竞赛：https://www.kaggle.com/c/forest-cover-type-prediction。

遵循此处的建议，我使用 sklearn创建了一个由500棵树构成的随机森林。 .oob_score_得分约为2％，但在保留集上的得分为约75％。只有七个类别需要分类，所以2％非常低。在交叉验证时，我还持续得到接近75％的得分。

有人能解释一下.oob_score_和保留/交叉验证得分之间的差异吗？我预期它们应该相似。

此处还有一个类似的问题：https://stats.stackexchange.com/questions/95818/what-is-a-good-oob-score-for-random-forests 编辑：我认为这可能是一个错误。

代码由第二个链接中的原始帖子提供。唯一的更改是在构建随机森林时必须设置oob_score = True。

我没有保存我所做的交叉验证测试结果，但如果需要，我可以重新进行测试。

- hahdawg

这个问题似乎不适合讨论，因为它涉及统计学而非编程。 - Fred Foo

嗯，这听起来有点像一个 bug :-/。你能把你的代码发布在某个地方吗？ - Andreas Mueller

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user3666197 · Accepted Answer

问：有人能解释一下出现的差异吗？

答：sklearn.ensemble.RandomForestClassifier对象及其观察到的.oob_score_属性值不是与错误相关的问题。

首先，基于RandomForest的预测器{分类器|回归器}属于所谓集成方法的特定领域，因此请注意，典型的方法，包括交叉验证，不像其他AI/ML学习者那样工作。

RandomForest“内部”逻辑严重依赖于随机过程，其中具有已知y == {标签（对于分类器）|目标（对于回归器）}的样本（数据集X）被划分到森林生成中，其中树通过将数据集随机分成树可以看到的部分和树将无法看到的部分（从而形成内部oob子集）进行自举。

除了对过度拟合等敏感性的其他影响外，RandomForest集成不需要进行交叉验证，因为它的设计不会过度拟合。许多论文和Breiman（伯克利）的经验证明提供了支持这种说法的证据，因为它们提供了CV-ed预测器将具有相同的.oob_score_的证据。

import sklearn.ensemble
aRF_PREDICTOR = sklearn.ensemble.RandomForestRegressor( n_estimators                = 10,           # The number of trees in the forest.
                                                        criterion                   = 'mse',        # { Regressor: 'mse' | Classifier: 'gini' }
                                                        max_depth                   = None,
                                                        min_samples_split           = 2,
                                                        min_samples_leaf            = 1,
                                                        min_weight_fraction_leaf    = 0.0,
                                                        max_features                = 'auto',
                                                        max_leaf_nodes              = None,
                                                        bootstrap                   = True,
                                                        oob_score                   = False,        # SET True to get inner-CrossValidation-alike .oob_score_ attribute calculated right during Training-phase on the whole DataSET
                                                        n_jobs                      = 1,            # { 1 | n-cores | -1 == all-cores }
                                                        random_state                = None,
                                                        verbose                     = 0,
                                                        warm_start                  = False
                                                        )
aRF_PREDICTOR.estimators_                             # aList of <DecisionTreeRegressor>  The collection of fitted sub-estimators.
aRF_PREDICTOR.feature_importances_                    # array of shape = [n_features]     The feature importances (the higher, the more important the feature).
aRF_PREDICTOR.oob_score_                              # float                             Score of the training dataset obtained using an out-of-bag estimate.
aRF_PREDICTOR.oob_prediction_                         # array of shape = [n_samples]      Prediction computed with out-of-bag estimate on the training set.
    
aRF_PREDICTOR.apply(         X )                      # Apply trees in the forest to X, return leaf indices.
aRF_PREDICTOR.fit(           X, y[, sample_weight] )  # Build a forest of trees from the training set (X, y).
aRF_PREDICTOR.fit_transform( X[, y] )                 # Fit to data, then transform it.
aRF_PREDICTOR.get_params(          [deep] )           # Get parameters for this estimator.
aRF_PREDICTOR.predict(       X )                      # Predict regression target for X.
aRF_PREDICTOR.score(         X, y[, sample_weight] )  # Returns the coefficient of determination R^2 of the prediction.
aRF_PREDICTOR.set_params(          **params )         # Set the parameters of this estimator.
aRF_PREDICTOR.transform(     X[, threshold] )         # Reduce X to its most important features.

需要注意的是，默认值并不是最好的选择，在任何情况下都不是最好的。在继续之前，需要关注问题领域，以提出合理的组合参数集。

Q: 什么是良好的.oob_score_？

A: .oob_score_ 是随机的！. . . . . . ..是的，它必须是随机的

虽然这听起来有点挑衅性，但不要灰心丧气。随机森林是一个很好的工具。一些问题可能会出现在特征（数据集X）中的分类值中，然而一旦你不必与偏差或过度拟合斗争，处理集合的成本仍然是适当的。 这很棒，对不对?

为了能够在后续重复运行时重现相同的结果，推荐的做法是在随机过程（嵌入到每个随机森林集合的自助抽样中）之前将 numpy.random 和.set_params(random_state= ...) 设置为已知状态。这样做可以观察到基于随机森林的预测器的“去噪声”进展，从而更好地 .oob_score_ 而不是由于更多集合成员（n_estimators），树构造的限制较少（max_depth、max_leaf_nodes等）而仅仅是通过随机过程“更好的运气”的随机分割数据集引入了预测能力的真正改善...

向更好的解决方案靠近通常需要将更多的树放入集合中（随机森林的决策基于多数投票，因此10个评估器对高度复杂的数据集进行良好决策并不是一个很大的基础）。2000以上的数字并不罕见。可以迭代一系列尺寸（保持随机过程在状态控制下）以证明集合的“改进”。

如果 .oob_score_ 的初始值大约在0.51 - 0.53左右，则您的集合比随机猜测好1％-3％

只有在将基于集合的预测器变得更好之后，才能转向其他特征工程等额外技巧。

aRF_PREDICTOR.oob_score_    Out[79]: 0.638801  # n_estimators =   10
aRF_PREDICTOR.oob_score_    Out[89]: 0.789612  # n_estimators =  100