按组在数据框上运行自定义函数

Question

按组在数据框上运行自定义函数

19

自定义函数用于在数据框中循环遍历一组数据。

以下是一些示例数据：

set.seed(42)
tm <- as.numeric(c("1", "2", "3", "3", "2", "1", "2", "3", "1", "1"))
d <- as.numeric(sample(0:2, size = 10, replace = TRUE))
t <- as.numeric(sample(0:2, size = 10, replace = TRUE))
h <- as.numeric(sample(0:2, size = 10, replace = TRUE))

df <- as.data.frame(cbind(tm, d, t, h))
df$p <- rowSums(df[2:4])

我创建了一个自定义函数来计算值w：

calc <- function(x) {
  data <- x
  w <- (1.27*sum(data$d) + 1.62*sum(data$t) + 2.10*sum(data$h)) / sum(data$p)
  w
  }

当我在整个数据集上运行该函数时，我得到以下答案：

calc(df)
[1]1.664474

理想情况下，我希望返回按照tm分组的结果，例如：

tm     w
1    result of calc
2    result of calc
3    result of calc

到目前为止，我尝试使用aggregate与我的函数，但是我遇到了以下错误：

aggregate(df, by = list(tm), FUN = calc)
Error in data$d : $ operator is invalid for atomic vectors

我觉得我盯着这个问题看得太久了，答案显而易见。

- BillPetti

6个回答

14

使用 dplyr

library(dplyr)
df %>% 
   group_by(tm) %>%
   do(data.frame(val=calc(.)))
#  tm      val
#1  1 1.665882
#2  2 1.504545
#3  3 1.838000

如果我们稍微修改一下这个函数，让它包括多个参数，那么它也可以与summarise一起使用。

 calc1 <- function(d1, t1, h1, p1){
      (1.27*sum(d1) + 1.62*sum(t1) + 2.10*sum(h1) )/sum(p1) }
 df %>%
     group_by(tm) %>% 
     summarise(val=calc1(d, t, h, p))
 #  tm      val
 #1  1 1.665882
 #2  2 1.504545
 #3  3 1.838000

- akrun

5

自从 dplyr 0.8 版本以后，你可以使用 group_map：

library(dplyr)
df %>% group_by(tm) %>% group_map(~tibble(w=calc(.)))
#> # A tibble: 3 x 2
#> # Groups:   tm [3]
#>      tm     w
#>   <dbl> <dbl>
#> 1     1  1.67
#> 2     2  1.50
#> 3     3  1.84

- moodymudskipper

4

library(plyr)
ddply(df, .(tm), calc)

- MrGumble

这正是我最初想要的，但我试图在 dplyr 中完成。你知道对应的等效操作吗？ - BillPetti

非常好的追问。我之前没有考虑到dplyr会替代ddply（和相关函数）。我现在正在寻找答案... - MrGumble

我能提供的最接近的答案是：group_by(df, tm) %>% do(calc(.))，但添加的as.data.frame不太美观。 - MrGumble

跟进一下，需要“执行”的函数返回一个数据框，而不是标量。只要“calc”返回一个数据框，就是安全的。 - MrGumble

0

...以及map函数的解决方案...

library(purrr)
df %>% 
    split(.$tm) %>% 
    map_dbl(calc)
# 1        2        3 
# 1.665882 1.504545 1.838000

- Richard

0

这是一个巧妙的解决方案，也完全兼容整洁的格式，下面以使用palmerpenguins数据集和线性回归模型的示例进行说明：

palmerpenguins::penguins |> 
  drop_na() |> 
  group_by(species) |> 
  nest() |> 
  mutate(
    test_results = map(
      .x = data,
      .f = ~ lm(body_mass_g ~ flipper_length_mm, data = .x
      )
      |> broom::tidy(conf.int = TRUE)
    )
  ) |> 
  unnest(test_results) |> 
  select(species, term, estimate, p.value, conf.low, conf.high) |> 
  filter(term != "(Intercept)") |> 
  ungroup()

- Simen Løkken

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Colonel Beauvel · Accepted Answer

您可以尝试使用split：

sapply(split(df, tm), calc)

#       1        2        3 
#1.665882 1.504545 1.838000

如果您想要一个列表，请使用lapply(split(df, tm), calc)。

或者可以使用data.table：

library(data.table)

setDT(df)[,calc(.SD),tm]
#   tm       V1
#1:  1 1.665882
#2:  2 1.504545
#3:  3 1.838000