如何在glmnet和交叉验证中自动化变量选择

Question

如何在glmnet和交叉验证中自动化变量选择

5

我正在学习如何使用glmnet和brnn软件包。考虑以下代码：

library(RODBC)
library(brnn)
library(glmnet)
memory.limit(size = 4000)
z <-odbcConnect("mydb") # database with Access queries and tables

# import the data
f5 <- sqlFetch(z,"my_qry")

# head(f5)

# check for 'NA'
sum(is.na(f5))

# choose a 'locn', up to 16 of variable 'locn' are present
f6 <- subset(f5, locn == "mm")
# dim(f6)

# use glmnet to identify possible iv's

training_xnm <- f6[,1:52] # training data
xnm <- as.matrix(training_xnm)
y <- f6[,54] # response

fit.nm <- glmnet(xnm,y, family="binomial", alpha=0.6, nlambda=1000,standardize=TRUE,maxit=100000)
# print(fit.nm)

# cross validation for glmnet to determine a good lambda value
cv.fit.nm <- cv.glmnet(xnm, y)

# have a look at the 'min' and '1se' lambda values
cv.fit.nm$lambda.min
cv.fit.nm$lambda.1se
# returned $lambda.min of 0.002906279, $lambda.1se of 2.587214

# for testing purposes I choose a value between 'min' and '1se'
mid.lambda.nm = (cv.fit.nm$lambda.min + cv.fit.nm$lambda.1se)/2

print(coef(fit.nm, s = mid.lambda.nm)) # 8 iv's retained

# I then manually inspect the data frame and enter the column index for each of the iv's
# these iv's will be the input to my 'brnn' neural nets

cols <- c(1, 3, 6, 8, 11, 20, 25, 38) # column indices of useful iv's

# brnn creation: only one shown but this step will be repeated
# take a 85% sample from data frame
ridxs <- sample(1:nrow(f6), floor(0.85*nrow(f6)) ) # row id's
f6train <- f6[ridxs,] # the resultant data frame of 85%
f6train <-f6train[,cols] # 'cols' as chosen above

# For the 'brnn' phase response is a binary value, 'fin'
# and predictors are the 8 iv's found earlier
out = brnn( fin ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, data=f6train, neurons=3,normalize=TRUE, epochs=500, verbose=FALSE)
#summary(out)

# see how well the net predicts the training cases
pred <- predict(out)

上述脚本运行正常。

我的问题是：我如何自动化上述脚本以运行不同的locn值，即如何泛化获取步骤：cols <- c(1, 3, 6, 8, 11, 20, 25, 38) # column indices of useful iv's。目前，我可以手动执行此操作，但无法看到如何以通用方式针对不同的locn值执行此操作，例如：

locn.list <- c("am", "bm", "cm", "dm", "em")  
for(j in 1:5) {
this.locn <- locn.list[j]
# run the above script
}

- cousin_pete

看起来无法使用您的数据进行任何测试，但您应该立即了解，在标记后使用“（”会使R查找该名称的函数。可能想要使用locn.list[j]。 j <- 1行似乎完全是多余的。 - IRTFM

感谢DWin的评论：我的错别字，我同意j <- 1是多余的！ - cousin_pete

感谢DWin的评论：我的错别字，我同意j <- 1是多余的！正如我所提到的，运行代码没有问题，我的问题是如何从交叉验证后的glmnet中概括有用变量的集合。目前，我每天使用实时财务数据的一个'locn'值多次使用该代码。我可以为所有17个'locn'值制作单独的脚本并依次运行它们，但我希望能够以编程方式捕获以cols <- c(1,......开头的行，而不必为每个'locn'手动输入此行。 - cousin_pete

当您同意代码中存在错误时，您应该编辑您的问题。如果您能够清楚地提供数据集，我对这个问题很感兴趣。 - IRTFM

谢谢DWin，我已经按照您的建议编辑了我的帖子。 - cousin_pete

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- cousin_pete · Accepted Answer

自从我发布了我的问题后，我发现了一篇由Simon、Friedman、Hastie和Tibshirani撰写的论文: Coxnet: 正则化Cox回归，它解决了如何提取我想要的内容。

这篇论文中有些相关细节已经被改编为我的数据(除了lambda符号!)：我们可以检查我们的模型选择了哪些协变量并查看这些协变量的系数。

coef(fit.nm, s = cv.fit.nm$lambda.min) # returns the p length coefficient vector

对应于 lambda = cv.fit$lambda.min 的解的内容。

Coefficients <- coef(fit.nm, s = cv.fit.nm$lambda.min)
Active.Index <- which(Coefficients != 0)
Active.Coefficients <- Coefficients[Active.Index]

Active.Index # identifies the covariates that are active in the model and
Active.Coefficients # shows the coefficients of those covariates

希望这对其他人有所帮助！