使用公式和非公式进行caret训练的结果不同

Question

使用公式和非公式进行caret训练的结果不同

5

我注意到在Caret中使用公式和非公式方法进行训练会产生不同的结果。此外，公式方法所需时间几乎是非公式方法所需时间的10倍。这是否是预期的？

> z <- data.table(c1=sample(1:1000,1000, replace=T), c2=as.factor(sample(LETTERS, 1000, replace=T)))

# SYSTEM TIME WITH FORMULA METHOD
# -------------------------------

> system.time(r <- train(c1 ~ ., z, method="rf", importance=T))
   user  system elapsed
376.233   9.241  18.190

> r
1000 samples
   1 predictors

No pre-processing
Resampling: Bootstrap (25 reps)

Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ...

Resampling results across tuning parameters:

  mtry  RMSE  Rsquared  RMSE SD  Rsquared SD
  2     295   0.00114   4.94     0.00154
  13    300   0.00113   5.15     0.00151
  25    300   0.00111   5.16     0.00146

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 2.


# SYSTEM TIME WITH NON-FORMULA METHOD
# -------------------------------

> system.time(r <- train(z[,2,with=F], z$c1, method="rf", importance=T))
       user  system elapsed
     34.984   2.977   2.708
    Warning message:
    In randomForest.default(trainX, trainY, mtry = tuneValue$.mtry,  :
  invalid mtry: reset to within valid range
> r
1000 samples
   1 predictors

No pre-processing
Resampling: Bootstrap (25 reps)

Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ...

Resampling results

  RMSE  Rsquared  RMSE SD  Rsquared SD
  297   0.00152   6.67     0.00197

Tuning parameter 'mtry' was held constant at a value of 2

- xbsd

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- topepo · Accepted Answer

您有一个分类预测变量，其级别数量适中。当您使用公式界面时，大多数建模函数（包括train、lm、glm等）会在内部运行model.matrix来处理数据集。这将从任何因子变量创建虚拟变量。非公式界面不执行此操作[1]。

当您使用虚拟变量时，在任何分裂中仅使用一个因子级别。树方法以不同的方式处理分类预测变量，但是当不使用虚拟变量时，随机森林将基于其结果对因子预测变量进行排序，并查找因子级别的2路分裂[2]。这需要更长的时间。

Max

[1] 很抱歉我成为那些说“在我的书中我展示...”的人之一，但是在这种情况下我会。图14.2有一个良好的CART树过程演示。

[2] 上帝啊，我又开始了。树状图中因子的不同表示在14.1节中讨论，并且在14.7节中显示了一个数据集的两种方法的比较。