使用纯ranger软件包进行超参数调整

Question

使用纯ranger软件包进行超参数调整

12

喜欢 ranger 包用于随机森林模型创建的速度，但不知道如何调整mtry或树的数量。我意识到可以通过caret的train()语法实现这一点，但我更喜欢使用纯ranger带来的速度提升。

以下是我的基本模型创建示例，使用ranger非常好用:

library(ranger)
data(iris)

fit.rf = ranger(
  Species ~ .,
  training_data = iris,
  num.trees = 200
)

print(fit.rf)

从官方调优选项文档来看，似乎csrf()函数可以提供调整超参数的能力，但我无法正确使用语法：

library(ranger)
data(iris)

fit.rf.tune = csrf(
  Species ~ .,
  training_data = iris,
  params1 = list(num.trees = 25, mtry=4),
  params2 = list(num.trees = 50, mtry=4)
)

print(fit.rf.tune)

结果为：

Error in ranger(Species ~ ., training_data = iris, num.trees = 200) : 
  unused argument (training_data = iris)

我更倾向于使用ranger提供的常规（非csrf）rf算法进行调优。有没有关于ranger中这两种路径的超参数调整解决方案的想法？谢谢！

- Levi Thatcher

5个回答

6

我认为至少有两个错误：

首先，函数ranger没有一个名为training_data的参数。你的错误信息Error in ranger(Species ~ ., training_data = iris, num.trees = 200) : unused argument (training_data = iris)指的是这个问题。你可以通过查看?ranger或者args(ranger)来确认。

其次，csrf函数则需要training_data和test_data这两个输入，但是这两个参数没有任何默认值，这意味着必须要提供它们。下面的代码可以正常运行：

fit.rf = ranger(
  Species ~ ., data = iris,
  num.trees = 200
)

fit.rf.tune = csrf(
Species ~ .,
training_data = iris,
test_data = iris,
params1 = list(num.trees = 25, mtry=4),
params2 = list(num.trees = 50, mtry=4)
)

在这里，我只是提供了iris作为训练和测试数据集。在实际应用中，您显然不希望这样做。此外，请注意，ranger还将num.trees和mtry作为输入，因此您可以尝试在那里进行调整。

- coffeinjunky

非常棒的信息，谢谢！据您所知，ranger中没有非CSRF路线进行超参数调整吗？此外，Zheyuan，我最初询问是否有非CSRF选项可用（而不仅仅是针对文档化的CSRF实现的修复）。 - Levi Thatcher

非常慷慨，谢谢各位。只是一个提醒，coffeinjunky - 尽管我发布的错误消息显示我使用了ranger函数，但实际上我使用了csrf函数（不确定您是否想编辑您的回复）。我会给Marvin Wright（维护者）发送一封电子邮件，告诉他这个情况。再次感谢！ - Levi Thatcher

另外，coffeinjunky，如果你在编辑，请问能否添加一个使用ranger函数进行参数1和参数2语法调整的示例呢？谢谢！ - Levi Thatcher

只需在函数参数中输入num.trees=5或其他数字，或者输入mtry=5或其他数字，例如ranger(Species ~ ., data = iris, num.trees = 200, mtry=5)。 - coffeinjunky

4

请注意，默认情况下，mlr 禁用了 ranger 的内部并行化。设置超参数 num.threads 为可用的核心数，以加速 mlr：

learner <- makeLearner("classif.ranger", num.threads = 4)

或者，通过以下方式启动并行后端：

parallelStartMulticore(4) # linux/osx
parallelStartSocket(4)    # windows

在并行调整之前，请调用tuneParams。

- Michel

4

调整模型的另一种方法是创建手动网格，也许有更好的训练模型的方式，但这可能是一个不同的选择。

hyper_grid <- expand.grid(
  mtry       = 1:4,
  node_size  = 1:3,
  num.trees = seq(50,500,50),
  OOB_RMSE   = 0
)

system.time(
  for(i in 1:nrow(hyper_grid)) {
    # train model
    rf <- ranger(
      formula        = Species ~ .,
      data           = iris,
      num.trees      = hyper_grid$num.trees[i],
      mtry           = hyper_grid$mtry[i],
      min.node.size  = hyper_grid$node_size[i],
      importance = 'impurity')
    # add OOB error to grid
    hyper_grid$OOB_RMSE[i] <- sqrt(rf$prediction.error)
  })
user  system elapsed 
3.17    0.19    1.36

nrow(hyper_grid) # 120 models
position = which.min(hyper_grid$OOB_RMSE)
head(hyper_grid[order(hyper_grid$OOB_RMSE),],5)
     mtry node_size num.trees     OOB_RMSE
6     2         2        50 0.1825741858
23    3         3       100 0.1825741858
3     3         1        50 0.2000000000
11    3         3        50 0.2000000000
14    2         1       100 0.2000000000

# fit best model
rf.model <- ranger(Species ~ .,data = iris, num.trees = hyper_grid$num.trees[position], importance = 'impurity', probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position])
rf.model
Ranger result

Call:
 ranger(Species ~ ., data = iris, num.trees = hyper_grid$num.trees[position], importance = "impurity", probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position]) 

    Type:                             Classification 
Number of trees:                  50 
Sample size:                      150 
Number of independent variables:  4 
Mtry:                             2 
Target node size:                 2 
Variable importance mode:         impurity 
Splitrule:                        gini 
OOB prediction error:             5.33 %

我希望它对你有所帮助。

- Rafael Díaz

0

还有tuneRanger R包，专门为调整ranger而设计，使用预定义的调整参数、超参数空间和智能调整，通过使用袋外观察来进行。

请注意，随机森林不是一个调整可以产生很大差异的算法。但通常可以稍微提高性能。

- PhilippPro

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Levi Thatcher · Accepted Answer

为了回答我那个不太清晰的问题，显然 ranger 没有内置的 CV/GridSearch 功能。不过，在 caret 之外这是如何用 ranger（通过网格搜索）进行超参数调整的方法。感谢 Marvin Wright（ranger 的维护者）提供的代码。原来，我的 caret CV 运行缓慢是因为我使用了公式界面（应该避免使用）。

ptm <- proc.time()
library(ranger)
library(mlr)

# Define task and learner
task <- makeClassifTask(id = "iris",
                        data = iris,
                        target = "Species")

learner <- makeLearner("classif.ranger")

# Choose resampling strategy and define grid
rdesc <- makeResampleDesc("CV", iters = 5)
ps <- makeParamSet(makeIntegerParam("mtry", 3, 4),
                   makeDiscreteParam("num.trees", 200))

# Tune
res = tuneParams(learner, task, rdesc, par.set = ps,
           control = makeTuneControlGrid())

# Train on entire dataset (using best hyperparameters)
lrn = setHyperPars(makeLearner("classif.ranger"), par.vals = res$x)
m = train(lrn, iris.task)

print(m)
print(proc.time() - ptm) # ~6 seconds

对于好奇的人，插入符号的等效表示为

ptm <- proc.time()
library(caret)
data(iris)

grid <-  expand.grid(mtry = c(3,4))

fitControl <- trainControl(method = "CV",
                           number = 5,
                           verboseIter = TRUE)

fit = train(
  x = iris[ , names(iris) != 'Species'],
  y = iris[ , names(iris) == 'Species'],
  method = 'ranger',
  num.trees = 200,
  tuneGrid = grid,
  trControl = fitControl
)
print(fit)
print(proc.time() - ptm) # ~2.4 seconds

总的来说，如果使用非公式接口，caret是使用ranger进行网格搜索的最快方式。