当类别数量大于两个时 (family = "multinomial")，使用 glmnet 进行特征选择

Question

当类别数量大于两个时 (family = "multinomial")，使用 glmnet 进行特征选择

4

众所周知，glmnet 可以作为一种工具用于特征选择。以下是一个玩具示例：

library(glmnet)

# Binomial dataset, the number of classes is 2
data(BinomialExample)
# data truncation to 10 columns, just to make the example dataset smaller
x <- BinomialExample$x[,1:10] 
y <- BinomialExample$y
cvfit = cv.glmnet(x, y, family = "binomial")
coefs <- coef(cvfit)

coefs 变量显示已选择的特征（在此示例中除了 V1 和 V7 之外的所有特征）。这个结果清晰易懂。

> coefs
11 x 1 sparse Matrix of class "dgCMatrix"
                    s1
(Intercept)  0.1048257
V1           .        
V2           0.5901863
V3          -0.4060696
V4          -0.9627180
V5          -0.1067188
V6          -0.7813739
V7           .        
V8          -0.4106554
V9           0.5733065
V10         -1.0492793

问题在于如何解释输出结果，如果分类数量超过两个。以下是一个玩具例子：

# Multinomial, the number of classes is 3
data(MultinomialExample)
x <- MultinomialExample$x[,1:10] 
y <- MultinomialExample$y
cvfit = cv.glmnet(x, y, family = "multinomial")
coefs <- coef(cvfit)

现在，coefs 存储了三个要选择的特征集合。
问题：应该使用哪个集合作为 最佳特征集合？
换句话说：当我们有超过两类时，是否可以将 glmnet 用作特征选择工具？

> coefs
$`1`
11 x 1 sparse Matrix of class "dgCMatrix"
                      1
(Intercept) -0.03279324
V1           .         
V2          -0.08585827
V3           0.40882396
V4          -0.08639670
V5          -0.15763031
V6           0.22513768
V7           .         
V8           0.17657623
V9           .         
V10          .         

$`2`
11 x 1 sparse Matrix of class "dgCMatrix"
                      1
(Intercept)  0.01255996
V1          -0.21913800
V2           .         
V3           .         
V4           .         
V5           0.41329881
V6           .         
V7           .         
V8           .         
V9          -0.57131512
V10          0.52214739

$`3`
11 x 1 sparse Matrix of class "dgCMatrix"
                      1
(Intercept)  0.02023328
V1           0.09163282
V2           0.42655929
V3           .         
V4           0.29403632
V5           .         
V6          -0.12306560
V7           .         
V8          -0.44815059
V9           0.88580234
V10         -0.20920812

- artgram

2个回答

1

我不知道glmnet使用的确切参数，但是多项式回归可以表示为一组二项式回归，其中回归的数量取决于类别的数量。

cv.glmnet对这些单独的回归进行正则化，以执行您所称的“特征选择”（在您的示例中为Lasso正则化），因此您获得了每个模型的3个稀疏系数的正则化矩阵，其中一些被设置为零。

这样想：由于某些类别的特征可能具有预测能力，而其他类别则没有，因此所选的特征可能会因回归而异。正则化有助于解决这个问题。在这方面，该过程不寻找“全局”最佳的回归器集合。其他类型的正则化采用不同的方法，请参见stats.stackexchange上的此帖子。

- Martin C. Arnold

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Zheyuan Li · Accepted Answer

您需要使用分组LASSO惩罚，通过设置type.multinomial = "grouped"。然后您将看到所有类别的系数具有相同的零/非零模式。

library(glmnet)
data(MultinomialExample)  ## see "Note" below
#x <- MultinomialExample$x  ## see "Note" below
#y <- MultinomialExample$y  ## see "Note" below
cvfit <- cv.glmnet(x, y, family = "multinomial", type.multinomial = "grouped")
coef(cvfit)

注意：

内置数据的结构已更改。我正在使用glmnet 4.1.1，并且只需简单地使用

data(MultinomialExample)

但是您正在使用的是glmnet 4.1.4，需要

data(MultinomialExample)
x <- MultinomialExample$x
y <- MultinomialExample$y