cv.glmnet在使用岭回归时出现错误，但对于lasso回归没有问题——针对模拟数据的研究结果。

Question

cv.glmnet在使用岭回归时出现错误，但对于lasso回归没有问题——针对模拟数据的研究结果。

rglmnet

9

Gist

错误信息: Error in predmat[which, seq(nlami)] = preds : replacement has length zero

上下文: 数据使用二进制y进行模拟，但是有n个真实y的编码器。数据被叠加n次，并拟合模型，试图获得true y。

该错误出现在以下情况:

L2惩罚，但不包括L1惩罚。
当Y为编码器Y时，而不是真实Y时。
该错误不是确定性的，而是取决于种子。

更新：错误出现在1.9-8版本以后。1.9-8不会出错。

复现过程

基础数据：

library(glmnet)
rm(list=ls())
set.seed(123)

num_obs=4000
n_coders=2
precision=.8

X <- matrix(rnorm(num_obs*20, sd=1), nrow=num_obs)
prob1 <- plogis(X %*% c(2, -2, 1, -1, rep(0, 16))) # yes many zeros, ignore
y_true <- rbinom(num_obs, 1, prob1)
dat <- data.frame(y_true = y_true, X = X)

创建编码人员

classify <- function(true_y,precision){
  n=length(true_y)
  y_coder <- numeric(n)
  y_coder[which(true_y==1)] <- rbinom(n=length(which(true_y==1)),
                                      size=1,prob=precision)
  y_coder[which(true_y==0)] <- rbinom(n=length(which(true_y==0)),
                                      size=1,prob=(1-precision))
  return(y_coder)
}
y_codings <- sapply(rep(precision,n_coders),classify,true_y = dat$y_true)

堆叠一切

expanded_data <- do.call(rbind,rep(list(dat),n_coders))
expanded_data$y_codings <- matrix(y_codings, ncol = 1)

重现错误

由于该错误依赖于种子，因此需要循环。仅第一次循环会失败，其他两次都会成功完成。

X <- as.matrix(expanded_data[,grep("X",names(expanded_data))])

for (i in 1:1000) cv.glmnet(x = X,y = expanded_data$y_codings,
                            family="binomial", alpha=0)  # will fail
for (i in 1:1000) cv.glmnet(x = X,y = expanded_data$y_codings,
                            family="binomial", alpha=1)  # will not fail
for (i in 1:1000) cv.glmnet(x = X,y = expanded_data$y_true,
                            family="binomial", alpha=0)  # will not fail

任何想法，glmnet中这是从哪里来的，如何避免它？从我的阅读cv.glmnet，这是在cv例程之后，并且在cvstuff = do.call(fun, list(outlist, lambda, x, y, weights, offset, foldid, type.measure, grouped, keep))内部，我不理解它的作用，因此失败了，如何避免它。

会话（Ubuntu和PC）

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] glmnet_2.0-2    foreach_1.4.3   Matrix_1.2-7.1  devtools_1.12.0

loaded via a namespace (and not attached):
 [1] httr_1.2.1       R6_2.2.0         tools_3.3.1      withr_1.0.2      curl_2.1        
 [6] memoise_1.0.0    codetools_0.2-15 grid_3.3.1       iterators_1.0.8  knitr_1.14      
[11] digest_0.6.10    lattice_0.20-34

并且

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] glmnet_2.0-2    foreach_1.4.3   Matrix_1.2-7.1  devtools_1.12.0

loaded via a namespace (and not attached):
 [1] httr_1.2.1       R6_2.2.0         tools_3.3.1      withr_1.0.2      curl_2.1        
 [6] memoise_1.0.0    codetools_0.2-15 grid_3.3.1       iterators_1.0.8  digest_0.6.10   
[11] lattice_0.20-34

- Elad663

这似乎相当复杂。既然你已经有了y_true，为什么还要有y_codings？它们有什么区别？ - Hong Ooi

你并不观察y_true，但有一些人类编码者根据x编码y，并具有一定的精度。@HongOoi - Elad663

改变随机种子就解决了：https://github.com/lmweber/glmnet-error-example/blob/master/glmnet_error_example.R - Gary Weissman

在使用岭逻辑回归时，我在类似的情况下使用glmnet_2.0-5也遇到了相同的错误。正如评论中提到的那样（https://github.com/lmweber/glmnet-error-example/blob/master/glmnet_error_example.R），在逐步执行代码后，问题出在`mlami`大于所有`lambda`值。这个bug已经被告知给`glmnet`的开发人员了吗？ - rwolst

2个回答

3

好的，我刚刚运行了第一个循环并且成功完成了。这是使用glmnet 2.0.2版本。

这更像是一条评论，但内容太长了：当运行像这样依赖于随机数的测试时，您可以在进行测试时保存种子。这样可以让您跳到测试的中间位置，而不必每次都回到起点。

类似于这样：

results <- lapply(1:1000, function(x) {
    seed <- .Random.seed
    res <- try(glmnet(x, y, ...))  # so the code keeps running even if there's an error
    attr(res, "seed") <- seed
    res
})

现在您可以查看结果的类别，以检查是否有任何运行失败：

errs <- sapply(results, function(x) inherits(x, "try-error"))
any(errs)

您可以重新尝试失败的运行：

firstErr <- which(errs)[1]
.Random.seed <- attr(results[[firstErr]], "seed")
glmnet(x, y, ...)  # try failed run again

会话信息：

R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.850    
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] glmnetUtils_0.55    RevoUtilsMath_8.0.3 RevoUtils_8.0.3     RevoMods_8.0.3      RevoScaleR_8.0.6   
[6] lattice_0.20-33     rpart_4.1-10       

loaded via a namespace (and not attached):
[1] Matrix_1.2-2     parallel_3.2.2   codetools_0.2-14 rtvs_1.0.0.0     grid_3.2.2      
[6] iterators_1.0.8  foreach_1.4.3    glmnet_2.0-2

应该是Windows 10，而不是8；R 3.2.2版本不支持Win10。

- Hong Ooi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user2173836 · Accepted Answer

我在glmnet_2.0-5中遇到了同样的错误。它与某些情况下如何自动创建lambda有关。解决办法是提供自己的lambda。

例如：

cv.glmnet(x = X,
          y = expanded_data$y_codings,
          family="binomial", 
          alpha=0,
          lambda=exp(seq(log(0.001), log(5), length.out=100)))

赞扬 https://github.com/lmweber/glmnet-error-example/blob/master/glmnet_error_example.R。