XGBoost软件包和随机森林回归

Question

XGBoost软件包和随机森林回归

5

包允许构建随机森林（实际上，它选择一个随机列子集来选择整棵树中的变量分裂，而不是像经典版本的算法一样对于每个节点都随机选择，但这可以被容忍）。但似乎对于回归问题，只有森林中的一棵树（可能是最后一棵建立的）被使用。

为了确保这一点，考虑一个标准的玩具示例。

library(xgboost)
library(randomForest)
data(agaricus.train, package = 'xgboost')
    dtrain = xgb.DMatrix(agaricus.train$data,
 label = agaricus.train$label)
 bst = xgb.train(data = dtrain, 
                 nround = 1, 
                 subsample = 0.8, 
                 colsample_bytree = 0.5, 
                 num_parallel_tree = 100, 
                 verbose = 2, 
                 max_depth = 12)

answer1 = predict(bst, dtrain); 
(answer1 - agaricus.train$label) %*% (answer1 -  agaricus.train$label)

forest = randomForest(x = as.matrix(agaricus.train$data), y = agaricus.train$label, ntree = 50)

answer2 = predict(forest, as.matrix(agaricus.train$data))
(answer2 - agaricus.train$label) %*% (answer2 -  agaricus.train$label)

当然，xgboost随机森林的默认版本使用的不是Gini得分函数，而只是MSE；这很容易改变。此外，这样的验证是不正确的，它不会影响主要问题。无论试用哪种参数集，与randomForest实现相比，结果都出奇的差。这对于其他数据集也同样适用。

有人能提供一些关于这种奇怪行为的提示吗？当涉及到分类任务时，算法确实按预期工作。

好吧，所有树木都生长了，同时用于进行预测。您可以使用“predict”函数的“ntree_limit”参数来检查该过程。

主要问题仍然存在：xgbbost软件包生成的Random Forest算法的特定形式是否有效？

交叉验证、参数调整和其他乱七八糟的事情与此无关--每个人都可以向代码添加必要的更正并观察结果。

您可以像这样指定'objective'选项：

mse = function(predict, dtrain)
{
  real = getinfo(dtrain, 'label')
  return(list(grad = 2 * (predict - real),
              hess = rep(2, length(real))))
}

在选择分裂变量时，建议使用MSE。即使如此，与randomForest相比结果仍然出奇的差。

也许问题是学术性质的，涉及到随机选择特征子集以进行分割的方式。经典实现为每个分割单独选择一组特征子集（大小由randomForest软件包的'mtry'指定），而xgboost实现为每棵树选择一个特征子集（由'colsample_bytree'指定）。

因此，这种微小差别似乎对某些类型的数据集非常重要。确实很有趣。

- mv_

1

你有什么问题？ - eliasah

如我所言，我正在想为什么会得到这样的结果。这个包有错吗？或者我只是在使用包的函数时出现了错误？ - mv_

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Soren Havelund Welling · Accepted Answer

xgboost（随机森林风格）确实使用多个树来进行预测。但还有许多其他的差异需要探索。

我自己对xgboost不太熟悉，但很好奇。因此，我编写了下面的代码来可视化树。您可以运行代码以验证或探索其他差异。

您选择的数据集是分类问题，因为标签为0或1。我喜欢切换到简单的回归问题来可视化xgboost的操作。

真实模型：$y = x_1 * x_2$ + 噪声

如果您训练单个树或多个树，并使用下面的代码示例，您会发现学习的模型结构确实包含更多的树。仅从预测准确性不能确定训练了多少棵树。

也许预测结果不同，是因为实现方式不同。我知道的大约5种RF实现都不完全相同，而这个xgboost（rf风格）是最接近的“堂兄”。

我观察到colsample_bytree不等于mtry，因为前者为整个树使用相同的变量/列子集。我的回归问题只有一个大交互项，如果树仅使用x1或x2则无法学习。因此，在这种情况下，colsample_bytree必须设置为1，以在所有树中使用两个变量。常规RF可以使用mtry = 1来模拟此问题，因为每个节点将使用X1或X2 我发现您的randomForest预测没有进行袋外交叉验证。如果要得出任何有关预测的结论，您必须进行交叉验证，特别是对于完全生长的树。

请注意，您需要修复函数vec.plot，因为它不支持xgboost，因为其他一些框架的xgboost不能将data.frame作为有效输入。指令在代码中应该很清楚。

library(xgboost)
library(rgl)
library(forestFloor)
Data = data.frame(replicate(2,rnorm(5000)))
Data$y = Data$X1*Data$X2 + rnorm(5000)*.5
gradientByTarget =fcol(Data,3)
plot3d(Data,col=gradientByTarget) #true data structure

fix(vec.plot) #change these two line in the function, as xgboost do not support data.frame
#16# yhat.vec = predict(model, as.matrix(Xtest.vec))
#21# yhat.obs = predict(model, as.matrix(Xtest.obs))

#1 single deep tree
xgb.model =  xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
                     nrounds=1,params = list(max.depth=250))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget,grid=200)
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget)
#clearly just one tree

#100 trees (gbm boosting)
xgb.model =  xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
                     nrounds=100,params = list(max.depth=16,eta=.5,subsample=.6))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) 
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget) ##predictions are not OOB cross-validated!


#20 shallow trees (bagging)
xgb.model =  xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
                     nrounds=1,params = list(max.depth=250,
                     num_parallel_tree=20,colsample_bytree = .5, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #bagged mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2]))) #terrible fit!!
#problem, colsample_bytree is NOT mtry as columns are only sampled once
# (this could be raised as an issue on their github page, that this does not mimic RF)


#20 deep tree (bagging), no column limitation
xgb.model =  xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
                     nrounds=1,params = list(max.depth=500,
                     num_parallel_tree=200,colsample_bytree = 1, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #boosted mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])))
#voila model can fit data