确保 `mgcv::gam` 函数具有超时机制。

Question

确保 `mgcv::gam` 函数具有超时机制。

3

我正在使用mgcv拟合一些GAM，其中在某些情况下需要高k值来捕获复杂的行为。但是我注意到有时（某些数据集），当k很高时，拟合GAM需要很长时间，并且在这些情况下似乎也无法收敛。我需要尝试拟合大量数据集，不能在遇到其中一个可能需要很长时间并最终失败的数据集时等待三天！

我遇到了 R.utils::withTimeout()，它似乎是一个很有前途的工具，可以确保如果我遇到这些游戏时间陷阱之一，我就可以继续前进，但它对我来说表现不一致。下面是连续三次运行相同脚本的输出结果。请注意，在这三次中的第一次中，超时显然没有发生。我模糊地理解有些情况下，withTimeout 有望失败... 我想知道如何强制停止 mgcv::gam，可能需要使用不同的工具（我看到了对包 processx 的参考）。

这看起来很多，但实际上是完全重复的；下面还有会话信息。从下面的代码中可以看出，关键点是 withTimeout 显然第一次什么也没做（花了30秒），然后每次都会启动（8秒似乎足够接近6秒）。

谢谢！

> tictoc::tic()
> set.seed(2) ## simulate some data... 
> dat <- mgcv::gamSim(1
+               , n = 5000
+               , dist = "normal"
+               , scale = 2)
Gu & Wahba 4 term additive model
> b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+          , data = dat)
> summary(b)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, 
    k = myk)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.84735    0.02869   273.5   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
         edf Ref.df       F p-value    
s(x0)  5.057  6.321  87.162  <2e-16 ***
s(x1)  3.739  4.671 813.470  <2e-16 ***
s(x2) 20.716 25.865 354.052  <2e-16 ***
s(x3)  3.298  4.121   1.496   0.197    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.728   Deviance explained =   73%
GCV = 4.1439  Scale est. = 4.1158    n = 5000
> tictoc::toc()
30.418 sec elapsed
>  # just did this script twice; this part takes me 30 seconds
> 
> tmax <- 6
> 
> tictoc::tic()
> gf <-  tryCatch({
+   res <- R.utils::withTimeout({
+     set.seed(2) ## simulate some data... 
+     dat <- mgcv::gamSim(1
+                   , n = 5000
+                   , dist = "normal"
+                   , scale = 2)
+     b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+              , data = dat)
+     summary(b)
+   }, timeout = tmax)
+ }, TimeoutException = function(ex) {
+   message(paste("Timeout before gam fit complete (should take", tmax,  "seconds)"))
+   
+ })
Gu & Wahba 4 term additive model
> tictoc::toc()
30.332 sec elapsed
> myk <- 120
> 
> tictoc::tic()
> set.seed(2) ## simulate some data... 
> dat <- mgcv::gamSim(1
+               , n = 5000
+               , dist = "normal"
+               , scale = 2)
Gu & Wahba 4 term additive model
> b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+          , data = dat)
> summary(b)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, 
    k = myk)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.84735    0.02869   273.5   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
         edf Ref.df       F p-value    
s(x0)  5.057  6.321  87.162  <2e-16 ***
s(x1)  3.739  4.671 813.470  <2e-16 ***
s(x2) 20.716 25.865 354.052  <2e-16 ***
s(x3)  3.298  4.121   1.496   0.197    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.728   Deviance explained =   73%
GCV = 4.1439  Scale est. = 4.1158    n = 5000
> tictoc::toc()
30.415 sec elapsed
>  # just did this script twice; this part takes me 30 seconds
> 
> # I expect the next part to take 6 seconds based on tmax.
> # dpeending on n and k, it either does, or it can take WAYYY longer. How can I 
> # guarantee 6 seconds.
> tmax <- 6
> 
> tictoc::tic()
> gf <-  tryCatch({
+   res <- R.utils::withTimeout({
+     set.seed(2) ## simulate some data... 
+     dat <- mgcv::gamSim(1
+                   , n = 5000
+                   , dist = "normal"
+                   , scale = 2)
+     b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+              , data = dat)
+     summary(b)
+   }, timeout = tmax)
+ }, TimeoutException = function(ex) {
+   message(paste("Timeout before gam fit complete (should take", tmax,  "seconds)"))
+   
+ })
Gu & Wahba 4 term additive model
Timeout before gam fit complete (should take 6 seconds)
> tictoc::toc()
8.112 sec elapsed
> # Test timeout with mgcv::gam
> myk <- 120
> 
> tictoc::tic()
> set.seed(2) ## simulate some data... 
> dat <- mgcv::gamSim(1
+               , n = 5000
+               , dist = "normal"
+               , scale = 2)
Gu & Wahba 4 term additive model
> b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+          , data = dat)
> summary(b)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, 
    k = myk)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.84735    0.02869   273.5   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
         edf Ref.df       F p-value    
s(x0)  5.057  6.321  87.162  <2e-16 ***
s(x1)  3.739  4.671 813.470  <2e-16 ***
s(x2) 20.716 25.865 354.052  <2e-16 ***
s(x3)  3.298  4.121   1.496   0.197    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.728   Deviance explained =   73%
GCV = 4.1439  Scale est. = 4.1158    n = 5000
> tictoc::toc()
30.318 sec elapsed
>  # just did this script twice; this part takes me 30 seconds
> 
> # I expect the next part to take 6 seconds based on tmax.
> # dpeending on n and k, it either does, or it can take WAYYY longer. How can I 
> # guarantee 6 seconds.
> tmax <- 6
> 
> tictoc::tic()
> gf <-  tryCatch({
+   res <- R.utils::withTimeout({
+     set.seed(2) ## simulate some data... 
+     dat <- mgcv::gamSim(1
+                   , n = 5000
+                   , dist = "normal"
+                   , scale = 2)
+     b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+              , data = dat)
+     summary(b)
+   }, timeout = tmax)
+ }, TimeoutException = function(ex) {
+   message(paste("Timeout before gam fit complete (should take", tmax,  "seconds)"))
+   
+ })
Gu & Wahba 4 term additive model
Timeout before gam fit complete (should take 6 seconds)
> tictoc::toc()
8.04 sec elapsed

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10          compiler_4.2.3       pillar_1.8.1         R.methodsS3_1.8.2    R.utils_2.12.2       tools_4.2.3          mvnfast_0.2.8        digest_0.6.31        evaluate_0.20        lubridate_1.9.2     
[11] lifecycle_1.0.3      tibble_3.2.1         nlme_3.1-162         gtable_0.3.2         lattice_0.20-45      timechange_0.2.0     mgcv_1.8-42          pkgconfig_2.0.3      rlang_1.1.0          Matrix_1.5-3        
[21] cli_3.6.0            rstudioapi_0.14      patchwork_1.1.2.9000 yaml_2.3.7           parallel_4.2.3       xfun_0.37            fastmap_1.1.1        gratia_0.8.1         knitr_1.42           stringr_1.5.0       
[31] furrr_0.3.1          dplyr_1.1.0          generics_0.1.3       vctrs_0.6.0          globals_0.16.2       tictoc_1.1           grid_4.2.3           tidyselect_1.2.0     glue_1.6.2           listenv_0.9.0       
[41] R6_2.5.1             fansi_1.0.4          parallelly_1.34.0    rmarkdown_2.20       tidyr_1.3.0          purrr_1.0.1          ggplot2_3.4.1        magrittr_2.0.3       htmltools_0.5.4      scales_1.2.1        
[51] codetools_0.2-19     splines_4.2.3        colorspace_2.1-0     future_1.32.0        utf8_1.2.3           stringi_1.7.12       munsell_0.5.0        R.oo_1.25.0

- Michael Roswell

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mikael Jagan · Accepted Answer

如果 R 进程在不检查中断的 C 代码中挂起，则使用 setTimeLimit（或基于该函数的实用程序）可能会失败。 setSessionTimeLimit 也依赖于中断检查，因此通过在 R 进程本身或 R 子进程上设置超时，同样可能会失败。

在这种情况下，您可以使用诸如 Linux 上的 timeout 命令，在 R（子）进程上设置系统级超时。例如:

/* startInfiniteLoop.c */
#include <Rinternals.h>
SEXP startInfiniteLoop(void) {
    while (1) ;
    return R_NilValue;
}

## startInfiniteLoop.R
tools::Rcmd(c("SHLIB", "startInfiniteLoop.c"))
dyn.load("startInfiniteLoop.so")
.Call("startInfiniteLoop")

这个 R 进程挂起了，你需要手动终止它：

$ R -f startInfiniteLoop.R

这个也会卡住，但会在20秒后自动被杀掉：

$ timeout 20 R -f startInfiniteLoop.R

您可以使用system2从R进程调用timeout，因此如果您喜欢的话，可以完全在您的R脚本中指定R子进程的超时时间：

system2("timeout", c("20", "R", "-f", "startInfiniteLoop.R"))

显然，你的用例更加复杂。你可能需要在子进程中传递变量等等，但或许你可以自己填补这些细节。