基于重要性的变量筛选

Question

基于重要性的变量筛选

4

我在模型中遇到了筛选掉最不重要变量的困难。我收到了一个有超过4000个变量的数据集，并被要求减少进入模型的变量数量。我已经尝试了两种方法，但都失败了。我尝试的第一件事是在建模后手动检查变量重要性，并基于此删除不显著的变量。

# reproducible example
data <- iris

# artificial class imbalancing
data <- iris %>% 
  mutate(Species = as.factor(ifelse(Species == "virginica", "1", "0")))

使用简单的 Learner 一切都正常:

# creating Task
task <- TaskClassif$new(id = "score", backend = data, target = "Species", positive = "1")

# creating Learner
lrn <- lrn("classif.xgboost") 

# setting scoring as prediction type 
lrn$predict_type = "prob"

lrn$train(task)
lrn$importance()

 Petal.Width Petal.Length 
  0.90606304   0.09393696

问题在于数据高度不平衡，因此我决定使用GraphLearner和PipeOp操作符对多数群体进行欠采样，然后将其传递给AutoTuner：

# undersampling
po_under <- po("classbalancing",
               id = "undersample", adjust = "major",
               reference = "major", shuffle = FALSE, ratio = 1 / 2)

# combine learner with pipeline graph
lrn_under <- GraphLearner$new(po_under %>>% lrn)

# setting the autoTuner
at <- AutoTuner$new(
  learner = lrn_under,
  resampling = resample,
  measure = measure,
  search_space = ps_under,
  terminator = terminator,
  tuner = tuner
)

at$train(task)

目前的问题是尽管重要属性仍然在at中可见，但$importance（）不可用。

> at
<AutoTuner:undersample.classif.xgboost.tuned>
* Model: list
* Parameters: list()
* Packages: -
* Predict Type: prob
* Feature types: logical, integer, numeric, character, factor, ordered, POSIXct
* Properties: featureless, importance, missings, multiclass, oob_error, selected_features, twoclass, weights

所以我决定改变我的方法，尝试在Learner中添加过滤。但是我失败了更多。我开始研究这个mlr3book博客 - https://mlr3book.mlr-org.com/fs.html。我尝试像博客中一样向Learner添加importance = "impurity"，但它产生了一个错误。

> lrn <- lrn("classif.xgboost", importance = "impurity") 
Błąd w poleceniu 'instance[[nn]] <- dots[[i]]':
  nie można zmienić wartości zablokowanego połączenia dla 'importance'

基本上意思是这样的：

Error in 'instance[[nn]] <- dots[[i]]':  can't change value of blocked connection for 'importance'

我也尝试使用 PipeOp 过滤器解决这个问题，但它失败得很惨。我相信如果没有 importance = "impurity" 的话，我将无法做到。

所以我的问题是，有没有办法实现我想要的目标？

此外，如果可以在建模之前根据重要性进行过滤，为什么会出现这种情况呢？难道不应该基于模型结果吗？

- Radbys

简短回答：在建模之前删除接近零方差的特征（例如library(caret);names(dataset)[nearZeroVar(dataset)]），然后尝试删除高度相关的特征（例如，如果您有“年龄”（例如31岁）和“年龄段”（例如25-29岁，30-34岁），可能安全删除“年龄段”）。您可以以多种方式调查特征之间的相关性 - 我使用的一种方法是使用例如hclust和/或大型热图对所有样本的所有特征的分数进行聚类（我使用python进行此操作，因为R可能难以绘制> 100百万数据点），然后进行建模并检查特征重要性。 - jared_mamrot

嗨，谢谢分享。我的问题更多是技术性的，而不是理论性的。我知道你指出的那两种技术。我想知道为什么在使用GraphLearner或AutoTuner后$importance()函数不可用。 - Radbys

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mb706 · Accepted Answer

你无法访问变量$importance的原因是它是一个AutoTuner，它不直接提供变量重要性，只是“包装”正在调整的实际Learner。

训练好的GraphLearner保存在AutoTuner中的$learner里面。

# get the trained GraphLearner, with tuned hyperparameters
graphlearner <- at$learner

这个对象也没有$importance()方法。（理论上，一个GraphLearner可以包含多个Learner，那么它甚至不知道该给哪个重要性！）

获取实际的LearnerClassifXgboost对象有点繁琐，不幸的是，由于mlr3使用的"R6"对象系统存在缺陷：

获取未经训练的Learner对象
获取Learner的已训练状态并将其放入该对象中

# get the untrained Learner
xgboostlearner <- graphlearner$graph$pipeops$classif.xgboost$learner

# put the trained model into the Learner
xgboostlearner$state <- graphlearner$model$classif.xgboost

现在可以查询重要性。

xgboostlearner$importance()

您提供的书中示例在您的情况下无法工作，因为该书使用的是ranger学习器，而您正在使用xgboost。 importance = "impurity"是特定于ranger的。