使用caret实现完全可复制的并行模型

Question

使用caret实现完全可复制的并行模型

48

当我在caret中运行两个随机森林时，如果我设置一个随机种子，我会得到完全相同的结果：

当我在caret中运行两个随机森林时，如果我设置一个随机种子，我得到的结果是完全一样的。

library(caret)
library(doParallel)

set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))

set.seed(42)
model1 <- train(Species~., iris, method='rf', trControl=myControl)

set.seed(42)
model2 <- train(Species~., iris, method='rf', trControl=myControl)

> all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] TRUE

但是，如果我注册一个并行的后端来加速建模，每次运行模型时我会得到不同的结果：

cl <- makeCluster(detectCores())
registerDoParallel(cl)

set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))

set.seed(42)
model1 <- train(Species~., iris, method='rf', trControl=myControl)

set.seed(42)
model2 <- train(Species~., iris, method='rf', trControl=myControl)

stopCluster(cl)

> all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] "Component 2: Mean relative difference: 0.01813729"
[2] "Component 3: Mean relative difference: 0.02271638"

有没有办法解决这个问题？有一个建议是使用doRNG包，但是train使用了嵌套循环，当前不支持：

library(doRNG)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
registerDoRNG()

set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))

set.seed(42)
> model1 <- train(Species~., iris, method='rf', trControl=myControl)
Error in list(e1 = list(args = seq(along = resampleIndex)(), argnames = "iter",  : 
  nested/conditional foreach loops are not supported yet.
See the package's vignette for a work around.

更新：我认为可以使用doSNOW和clusterSetupRNG来解决这个问题，但我没能完全搞定。

set.seed(42)
library(caret)
library(doSNOW)
cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)

myControl <- trainControl(method='cv', index=createFolds(iris$Species))

clusterSetupRNG(cl, seed=rep(12345,6))
a <- clusterCall(cl, runif, 10000)
model1 <- train(Species~., iris, method='rf', trControl=myControl)

clusterSetupRNG(cl, seed=rep(12345,6))
b <- clusterCall(cl, runif, 10000)
model2 <- train(Species~., iris, method='rf', trControl=myControl)

all.equal(a, b)
[1] TRUE
all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] "Component 2: Mean relative difference: 0.01890339"
[2] "Component 3: Mean relative difference: 0.01656751"

stopCluster(cl)

foreach有什么特殊之处，为什么它不使用我在集群上初始化的种子？对象a和b是相同的，那么为什么model1和model2不是呢？

- Zach

也许这个问题会提供一些有用的信息...？链接 - joran

它确实提供了有用的信息。不幸的是，使用snow需要修改caret源代码，而使用doRNG则失败了。 - Zach

现在可以使用 library(doMC) - 参见 http://caret.r-forge.r-project.org/parallel.html - Stéphane Laurent

3个回答

8

所以Caret使用foreach包进行并行化。很可能在每次迭代中设置种子的方法，但我们需要在“train”中设置更多选项。

 。
另外，您可以创建一个自定义建模函数，模仿随机森林的内部函数，并自己设置种子。
Max

- topepo

0

您使用的caret版本是哪个？

@BBrill的答案是正确的。然而，自从v6.0.64（2016年1月15日）以来，caret已经考虑到了这个问题。您可以提供自定义的trControl$seeds，但不一定需要。如果trControl$seeds为NULL，caret将自动生成这些内容，这样即使进行并行训练也能保证可重复性。

这种行为可以在https://github.com/topepo/caret/commit/9f375a1704e413d0806b73ab8891c7fadc39081c中找到。

Pull request: https://github.com/topepo/caret/pull/353

相关代码片段：

    if(is.null(trControl$seeds) || all(is.na(trControl$seeds)))  {
      seeds <- sample.int(n = 1000000L, size = num_rs * nrow(trainInfo$loop) + 1L)
      seeds <- lapply(seq(from = 1L, to = length(seeds), by = nrow(trainInfo$loop)),
                      function(x) { seeds[x:(x+nrow(trainInfo$loop)-1L)] })
      seeds[[num_rs + 1L]] <- seeds[[num_rs + 1L]][1L]
      trControl$seeds <- seeds
    } else {
      (... omitted ...)
    }

如需更多详细信息，您可以

- Haizi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- BBrill · Accepted Answer

使用caret软件包中的seeds参数调用train control是一种在并行模式下运行完全可重复模型的简单方法。这里解决了上述问题，有关更多信息，请参阅trainControl帮助页面。

library(doParallel); library(caret)

#create a list of seed, here change the seed for each resampling
set.seed(123)

#length is = (n_repeats*nresampling)+1
seeds <- vector(mode = "list", length = 11)

#(3 is the number of tuning parameter, mtry for rf, here equal to ncol(iris)-2)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 3)

#for the last model
seeds[[11]]<-sample.int(1000, 1)

 #control list
 myControl <- trainControl(method='cv', seeds=seeds, index=createFolds(iris$Species))

 #run model in parallel
 cl <- makeCluster(detectCores())
 registerDoParallel(cl)
 model1 <- train(Species~., iris, method='rf', trControl=myControl)

 model2 <- train(Species~., iris, method='rf', trControl=myControl)
 stopCluster(cl)

 #compare
 all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] TRUE