以下是使用
doParallel
包并修改其他参数以加速
gafs()
函数的并行化示例。在可能的情况下,我会包含运行时间。
原始代码使用交叉验证(
method =“cv”
),而不是重复交叉验证(
method =“repeatedcv”
),因此我认为
repeats = 2
参数被忽略了。我没有在并行化的示例中包含该选项。
首先,使用原始代码而没有任何修改或并行化:
> library(caret)
> data(iris)
> set.seed(1)
> st.01 <- system.time(results.01 <- gafs(iris[,1:4], iris[,5],
iters = 2,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
repeats = 2,
verbose = TRUE),
trConrol = trainControl(method = "cv",
classProbs = TRUE,
verboseIter = TRUE)))
Fold01 1 0.9596575 (1)
Fold01 2 0.9596575->0.9667641 (1->1, 100.0%) *
Fold02 1 0.9598146 (1)
Fold02 2 0.9598146->0.9641482 (1->1, 100.0%) *
Fold03 1 0.9502661 (1)
我运行了上述代码一整夜(8至10小时),但因为运行时间过长而停止了它的运行。非常粗略的估计,运行时间至少需要24小时。
其次,包括将
popSize
参数(从50降至20)、在
gafsControl()
中添加
allowParallel
和
genParallel
选项,以及在
gafsControl()
和
trControl()
中将折叠数目(从10降至5)降低:
> library(doParallel)
> cl <- makePSOCKcluster(detectCores() - 1)
> registerDoParallel(cl)
> set.seed(1)
> st.09 <- system.time(results.09 <- gafs(iris[,1:4], iris[,5],
iters = 2,
popSize = 20,
method = "xgbTree",
metric = "Accuracy",
gafsControl = gafsControl(functions = caretGA,
method = "cv",
number = 5,
verbose = TRUE,
allowParallel = TRUE,
genParallel = TRUE),
trConrol = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
verboseIter = TRUE)))
final GA
1 0.9508099 (4)
2 0.9508099->0.9561501 (4->1, 25.0%) *
final model
> st.09
user system elapsed
3.536 0.173 4152.988
我的系统有4个核心,但按规定只使用了3个,我验证了它正在运行3个R进程。
gafsControl()
文档描述了allowParallel
和genParallel
的含义:
caret文档建议allowParallel
选项将比genParallel
选项提供更大的运行时改进:
https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html
我期望并行化代码与原始代码会略有不同的结果。这是并行化代码的结果:
> results.09
Genetic Algorithm Feature Selection
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
Maximum generations: 2
Population per generation: 20
Crossover probability: 0.8
Mutation probability: 0.1
Elitism: 0
Internal performance values: Accuracy, Kappa
Subset selection driven to maximize internal Accuracy
External performance values: Accuracy, Kappa
Best iteration chose by maximizing external Accuracy
External resampling method: Cross-Validated (5 fold)
During resampling:
* the top 4 selected variables (out of a possible 4):
Petal.Width (80%), Petal.Length (40%), Sepal.Length (20%), Sepal.Width (20%)
* on average, 1.6 variables were selected (min = 1, max = 4)
In the final search using the entire training set:
* 4 features selected at iteration 1 including:
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
* external performance at this iteration is
Accuracy Kappa
0.9467 0.9200
popSize
的建议是什么。?gafs
中"Details"部分的结尾是关于并行化的。 - Julius Vainora