如何保存Tidymodels Lightgbm模型以供重复使用。

5

我有以下代码,用于创建一个使用 lightgbm 模型的 tidymodels 工作流程。然而,当我尝试将其保存为 .rds 对象并进行预测时,会出现一些问题。

library(AmesHousing)
library(treesnip)
library(lightgbm)
library(tidymodels)
tidymodels_prefer()

### Model ###

# data
data <- make_ames() %>%
  janitor::clean_names()

data <- subset(data, select = c(sale_price, bedroom_abv_gr, bsmt_full_bath, bsmt_half_bath, enclosed_porch, fireplaces,
                                full_bath, half_bath, kitchen_abv_gr, garage_area, garage_cars, gr_liv_area, lot_area,
                                lot_frontage, year_built, year_remod_add, year_sold))

data$id <- c(1:nrow(data))

data <- data %>%
  mutate(id = as.character(id)) %>%
  select(id, everything())

# model specification

lgbm_model <- boost_tree(
  mtry = 7,
  trees = 347,
  min_n = 10,
  tree_depth = 12,
  learn_rate = 0.0106430579211173,
  loss_reduction = 0.000337948798058139,
) %>%
  set_mode("regression") %>%
  set_engine("lightgbm", objective = "regression")

# recipe and workflow

lgbm_recipe <- recipe(sale_price ~., data = data) %>%
  update_role(id, new_role = "ID") %>%
  step_corr(all_predictors(), threshold = 0.7) %>%
  prep()

lgbm_workflow <- workflow() %>% 
  add_recipe(lgbm_recipe) %>%
  add_model(lgbm_model)  
  
# fit workflow

fit_lgbm_workflow <- lgbm_workflow %>%
  fit(data = data)

# predict

data_predict <- subset(data, select = -c(sale_price))
predict(fit_lgbm_workflow, new_data = data_predict)


### CASE 1: Save the workflow with SaveRDS()

saveRDS(object = fit_lgbm_workflow, file = "lgbm_workflow.rds")
new_lgbm_workflow <- readRDS(file = "lgbm_workflow.rds")

# Predict - error: Attempting to use a Booster which no longer exists

predict(new_lgbm_workflow, new_data = data_predict)



### CASE 2: Save the workflow and the fitted model separately

fitted_model <- (fit_lgbm_workflow %>% extract_fit_parsnip())$fit
saveRDS(object = fit_lgbm_workflow, file = "lgbm_workflow.rds")
lightgbm::saveRDS.lgb.Booster(object = fitted_model, file = "lgbm_model.rds")


new_lgbm_workflow <- readRDS(file = "lgbm_workflow.rds")
new_lgbm_model <- lightgbm::readRDS.lgb.Booster(file = "lgbm_model.rds")
new_lgbm_workflow$fit$fit <- new_lgbm_model


# Predict - error: cannot predict on data of class ‘tbl_df’‘tbl’‘data.frame’

predict(new_lgbm_workflow, new_data = data_predict)

只有使用 lightgbm 模型的工作流似乎存在此问题。对于其他类型的模型(随机森林、xgboost、glm等),我可以使用 saveRDS() 将拟合的工作流保存下来,用 readRDS() 读取,然后使用新数据进行预测,一切正常。

对于第二种情况,显然基础的预测函数将被更改为 predict.lgb.Booster(),它接受一个 matrix 作为输入。但我的 id 变量具有 character 格式,而矩阵中的所有列必须具有相同的格式。

是否有一种方法可以保存整个 workflow 以供将来使用?


说句轶事,使用 readr::write_rds() 来保存工作流对象时我从未遇到过任何问题 - 或许你可以试试这个函数。 - Mark Rieke
不幸的是,我在treesnip包中的模型上运气不太好。 - Julia Silge
@griffinwings 你解决了这个问题吗?我遇到了完全相同的问题。这很遗憾,因为这种建模类型比XGBoost更快更准确。 - nate-m
@JuliaSilge,你们是否考虑通过tidymodels/bonsai来撰写关于LightGBM最佳实践的文章? - nate-m
@MarkRieke 我本来希望从treesnip转移到bonsai包可以解决这个问题,并允许我们原生地使用write_rds,但没有成功。我可以轻松写出,但当你尝试读回时就会出现问题。 - nate-m
显示剩余2条评论
2个回答

4

经过深入挖掘,我在这个已关闭的问题中找到了解决方案。

library(tidymodels)
#> Warning: package 'tidymodels' was built under R version 4.2.1
#> Warning: package 'broom' was built under R version 4.2.1
#> Warning: package 'scales' was built under R version 4.2.1
#> Warning: package 'infer' was built under R version 4.2.1
#> Warning: package 'modeldata' was built under R version 4.2.1
#> Warning: package 'parsnip' was built under R version 4.2.1
#> Warning: package 'rsample' was built under R version 4.2.1
#> Warning: package 'tibble' was built under R version 4.2.1
#> Warning: package 'workflows' was built under R version 4.2.1
#> Warning: package 'workflowsets' was built under R version 4.2.1
library(bonsai)
library(lightgbm)
#> Warning: package 'lightgbm' was built under R version 4.2.1
#> Loading required package: R6
#> 
#> Attaching package: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice

# data

data <- modeldata::ames %>%
  janitor::clean_names()

data <- subset(data, select = c(sale_price, bedroom_abv_gr, bsmt_full_bath, bsmt_half_bath, enclosed_porch, fireplaces,
                                full_bath, half_bath, kitchen_abv_gr, garage_area, garage_cars, gr_liv_area, lot_area,
                                lot_frontage, year_built, year_remod_add, year_sold))

data$id <- c(1:nrow(data))

data <- data %>%
  mutate(id = as.character(id)) %>%
  select(id, everything())

# model specification

lgbm_model <- boost_tree(
  mtry = 7,
  trees = 347,
  min_n = 10,
  tree_depth = 12,
  learn_rate = 0.0106430579211173,
  loss_reduction = 0.000337948798058139,
) %>%
  set_mode("regression") %>%
  set_engine("lightgbm", objective = "regression")

# recipe and workflow

lgbm_recipe <- recipe(sale_price ~., data = data) %>%
  update_role(id, new_role = "ID") %>%
  step_corr(all_predictors(), threshold = 0.7)

lgbm_workflow <- workflow(preprocessor = lgbm_recipe,
                          spec = lgbm_model)

# fit workflow

fit_lgbm_workflow <- lgbm_workflow %>%
  fit(data = data)

# predict

data_predict <- subset(data, select = -c(sale_price))
predict(fit_lgbm_workflow, new_data = data_predict)
#> # A tibble: 2,930 × 1
#>      .pred
#>      <dbl>
#>  1 201911.
#>  2 124695.
#>  3 138983.
#>  4 221095.
#>  5 198972.
#>  6 188613.
#>  7 198730.
#>  8 170893.
#>  9 243899.
#> 10 196875.
#> # … with 2,920 more rows

# save the trained workflow and lgb.booster object separately

saveRDS(fit_lgbm_workflow, "lgbm_wflw.rds")
saveRDS.lgb.Booster(extract_fit_engine(fit_lgbm_workflow), "lgbm_booster.rds")

# load trained workflow and merge it with lgb.booster

new_lgbm_wflow <- readRDS("lgbm_wflw.rds")
new_lgbm_wflow$fit$fit$fit <- readRDS.lgb.Booster("lgbm_booster.rds")

predict(new_lgbm_wflow, data_predict)
#> # A tibble: 2,930 × 1
#>      .pred
#>      <dbl>
#>  1 201911.
#>  2 124695.
#>  3 138983.
#>  4 221095.
#>  5 198972.
#>  6 188613.
#>  7 198730.
#>  8 170893.
#>  9 243899.
#> 10 196875.
#> # … with 2,920 more rows

创建于2022-09-07,使用reprex v2.0.2

在上面的示例中,我使用了一个工作流程来进行拟合。如果您正在使用一个parsnip对象来进行拟合,请改用以下方法:


saveRDS(bonsai_fit, path1)
saveRDS.lgb.Booster(extract_fit_engine(bonsai_fit), path2)
bonsai_fit_read <- readRDS(path1)
bonsai_fit_engine_read <- readRDS.lgb.Booster(path2)
bonsai_fit_read$fit <- bonsai_fit_engine_read


请参考此评论以获取更多细节。
好消息是:银色衬里是:
从2021年12月开始,{lightgbm}的开发版本支持直接使用readsRDS() / saveRDS()读取和保存{lightgbm}模型。

太棒了!我认为这基本上就是了。有一些小调整,saveRDS已经被弃用,所以我们需要使用lightgbm::lgb.savelightgbm::lgb.load。对于我的工作流程(在调整后使用select_best > finalize_workflow > last_fit),booster位于此处:new_lgbm_wflow$.workflow[[1]]$fit$fit$fit。使用这些调整和您的逻辑,我能够轻松加载并进行预测! - nate-m

1
我找到了一种保存lightgbm供将来参考的解决方案。它不使用tidymodel框架,而是必须先将其转换为lightgbm模型格式。如果你想评估变量重要性,也是同样的步骤。
基于以上代码:
# Convert to lightgbm booster model
lgb_model <- parsnip::extract_fit_engine(fit_lgbm_workflow) 

# If you want you can now evaluate variable importance. 
# Tidymodels does not support variable importance of lgb via bonsai currently

loss_varimp <- lgb_model %>%
    lgb.importance(.) 

# Save the booster out
lightgbm::lgb.save(lgb_model, filename_x)

# Read the booster in
lightgbm::lgb.load(filename_x)

我还没想清楚是否可以将加载的lightgbm合并回tidymodel格式,但现在至少可以预测、使用和评估,而无需每次重新运行模型。希望这有所帮助,请在找到更干净/更实时的解决方案后发布!


感谢分享这个解决方案。然而,将模型保存为lgb格式,在预测时需要对数据进行转换,就像这样 - https://github.com/tidymodels/bonsai/issues/45#:~:text=%3D%20penguins)-,new_data,-%3C%2D%0A%20%20%20%20penguins_subset_numeric%20%25。即使在此转换之后,我仍然会遇到新的错误。([LightGBM] [Fatal] 数据中的特征数(509)与训练数据中的特征数(488)不同。) - Desmond
相关问题已提出:https://github.com/tidymodels/bonsai/issues/44 和 https://github.com/tidymodels/stacks/issues/145 - Desmond

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接