将多个模型公式应用于数据组

Question

将多个模型公式应用于数据组

rdplyrpurrr

3

我想对我的数据应用3个线性模型，并提取每个模型的残差。我想知道是否有一种使用dplyr和purrr的组合来为每个模型应用相同步骤的方法：

我想保留以下内容：

每个模型的lm对象
每个模型的augment输出
每个模型的残差

这是一个分析mpg数据集的工作示例：

library(dplyr)
library(tidyr)
library(purrr)
library(broom)
library(ggplot2)

这里是我想用于lm的三个不同公式。

f1 = hwy ~ cyl
f2 = hwy ~ displ
f3 = hwy ~ cyl + displ

lin_mod = function(formula) {
  function(data) {
    lm(formula, data = data)
  }
}

这是我提取单个公式残差的方法：

mpg %>% 
group_by(manufacturer) %>% 
nest() %>% 
mutate(model = map(data, lin_mod(f1)), 
       aug = map(model, augment), 
       res = map(aug, ".resid"))

然而，这种方法似乎并不适用于所有公式，因为我需要重写很多代码。

mpg %>% 
group_by(manufacturer) %>% 
nest() %>% 
mutate(model1 = map(data, lin_mod(f1)), 
       aug1 = map(model1, augment), 
       res1 = map(aug1, ".resid"),
       model2 = map(data, lin_mod(f2)), 
       aug2 = map(model2, augment), 
       res2 = map(aug2, ".resid"),
       model3 = map(data, lin_mod(f3)), 
       aug3 = map(model3, augment), 
       res3 = map(aug3, ".resid"))

我该如何优雅地将此函数应用于每个公式？我想使用mutate_all或将公式放入列表中可能会有所帮助，但是我卡住了。

- kmace

3个回答

0

我正在研究如何对同一数据应用一系列模型公式的版本，并想出了使用tidyverse版本1.3.2更简洁的方法。

从高层次上来看，这个想法是将“公式映射到数据”而不是“数据映射到公式”。

library("tidyverse")

formulas <- list(
  f1 = hwy ~ cyl,
  f2 = hwy ~ displ,
  f3 = hwy ~ cyl + displ
)

一旦我们将公式应用于数据，我们会在一个tibble中enframe拟合的lm模型列表。

formulas %>%
  map(
    ~ lm(., mpg)
  ) %>%
  enframe()
#> # A tibble: 3 × 2
#>   name  value 
#>   <chr> <list>
#> 1 f1    <lm>  
#> 2 f2    <lm>  
#> 3 f3    <lm>

让我们编写一个函数，将公式列表应用于任何数据。

fit_model_to_data <- function(data, ...) {
  formulas %>%
    map(
      ~ lm(., data)
    ) %>%
    enframe(
      name = "formula",
      value = "model"
    )
}

一旦我们有了辅助函数，我们就可以按制造商进行分组，并将公式适配到每个拆分中。

fits <- mpg %>%
  group_by(
    manufacturer
  ) %>%
  group_modify(
    fit_model_to_data
  )
fits
#> # A tibble: 45 × 3
#> # Groups:   manufacturer [15]
#>    manufacturer formula model 
#>    <chr>        <chr>   <list>
#>  1 audi         f1      <lm>  
#>  2 audi         f2      <lm>  
#>  3 audi         f3      <lm>  
#>  4 chevrolet    f1      <lm>  
#>  5 chevrolet    f2      <lm>  
#>  6 chevrolet    f3      <lm>  
#>  7 dodge        f1      <lm>  
#>  8 dodge        f2      <lm>  
#>  9 dodge        f3      <lm>  
#> 10 ford         f1      <lm>  
#> # … with 35 more rows

此时，我们已经在一个方便的数据框中拟合了模型。提取残差很容易。

fits %>%
  mutate(
    .resid = map(model, residuals)
  )
#> # A tibble: 45 × 4
#> # Groups:   manufacturer [15]
#>    manufacturer formula model  .resid    
#>    <chr>        <chr>   <list> <list>    
#>  1 audi         f1      <lm>   <dbl [18]>
#>  2 audi         f2      <lm>   <dbl [18]>
#>  3 audi         f3      <lm>   <dbl [18]>
#>  4 chevrolet    f1      <lm>   <dbl [19]>
#>  5 chevrolet    f2      <lm>   <dbl [19]>
#>  6 chevrolet    f3      <lm>   <dbl [19]>
#>  7 dodge        f1      <lm>   <dbl [37]>
#>  8 dodge        f2      <lm>   <dbl [37]>
#>  9 dodge        f3      <lm>   <dbl [37]>
#> 10 ford         f1      <lm>   <dbl [25]>
#> # … with 35 more rows


packageVersion("tidyverse")
#> [1] '1.3.2'

^{使用 reprex v2.0.2 创建于2022年9月6日。}

- dipetkov

0

这是我能够通过遵循此处的示例所能想到的最接近的方法。

library(dplyr)
library(tidyr)
library(purrr)
library(broom)
library(ggplot2)

# Here are the three different formulas I want to use for my lm

f1 = hwy ~ cyl
f2 = hwy ~ displ
f3 = hwy ~ cyl + displ

formulas = c(f1,f2,f3)

lin_mod = function(formula) {
  function(data) {
    lm(formula, data = data)
  }
}

list_model = lapply(formulas, lin_mod)
names(list_model) = c('cyl_model', 'displ_model', 'full_model')


fn_model <- function(.model, df){
  df$model <- map(df$data, possibly(.model, NULL))
  df
}

mpg_nested = mpg %>% 
group_by(manufacturer) %>% 
nest()

mpg_nested_new = list_model %>% 
                 map_df(fn_model, mpg_nested, .id = 'id_model') %>% 
                 arrange(manufacturer) %>% 
                 mutate(aug = map(model, augment), 
                 res = map(aug, ".resid"))


output = mpg_nested_new %>% 
gather(Var, val, c('model', 'aug', 'res')) %>% 
unite(desc, id_model, Var)%>% 
spread(desc, val)

- kmace

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- dmi3kno · Accepted Answer

您可以使用mutate_at（或mutate_if）来原地突变列表列。这可以节省多次迭代，使代码可管道化和更加紧凑。

library(dplyr)
library(tidyr)
library(purrr)
library(broom)

lin_mod = function(formula) {
  function(data,...){
  map(data,~lm(formula, data = .x))
  }
}

list_model <- list(cyl_model= hwy ~ cyl,
                   displ_model= hwy ~ displ,
                   full_model= hwy ~ cyl + displ) %>% 
              lapply(lin_mod)

ggplot2::mpg %>% 
  group_by(manufacturer) %>% nest() %>% 
    mutate_at(.vars=("data"),.funs=list_model) %>% 
    mutate_at(.vars=vars(ends_with("model")), .funs=~map(.x, augment)) %>% 
    mutate_at(.vars=vars(ends_with("model")), .funs=~map(.x, ".resid")) %>% unnest()