如何同时使用dplyr的summarise和summarise_each函数？

Question

如何同时使用dplyr的summarise和summarise_each函数？

13

我想在一个分组的数据框中同时应用dplyr::summarise和dplyr::summarise_each。这有可能吗？

我的数据长这样：

mydf <- data.frame(
    id = c(rep(1,2), rep(2, 3), rep(3, 4)), 
    amount = c(rep(1,4), rep(2,5)), 
    type1 = c(rep(1, 2), rep(0, 7)),
    type2 = c(rep(0, 4), rep(1, 5))
)
mydf
#  id amount type1 type2
#1  1      1     1     0
#2  1      1     1     0
#3  2      1     0     0
#4  2      1     0     0
#5  2      2     0     1
#6  3      2     0     1
#7  3      2     0     1
#8  3      2     0     1
#9  3      2     0     1

我想对id变量求和并取得type变量中的最大值，可以按照以下方式实现：

mydf %>% 
    group_by(id) %>% 
    summarise(amount = sum(amount), type1 = max(type1), type2 = max(type2))

然而，我有很多type变量，因此我更喜欢像这样的东西（但也包括amount的总和）。

mydf %>%
    group_by(id) %>%
    summarise_each(funs(max), matches("type"))

- janosdivenyi

有趣的问题。我想知道你是否接受非dplyr解决方案。 - David Arenburg

dplyr 可能不允许这样做。那么我应该寻找一个非 dplyr 的解决方案。 - janosdivenyi

1

可能是 unique(mydf %>% group_by(id) %>% mutate(amount = sum(amount)) %>% mutate_each(funs(max), matches("type"))) ？ - Veerendra Gadekar

@VeerendraGadekar 这是一个不错的解决方法，谢谢。如果您将其发布为答案，我可以接受它。 - janosdivenyi

1

@VeerendraGadekar 保持管道：

mydf％>%按id分组％>% mutate（amount = sum（amount））％>% mutate_each（funs（max），matches（“type”））％>% unique

- Carlos Cinelli

3个回答

7

我不确定使用dplyr的惯用方式，但这个使用data.table相当惯用。

library(data.table)
setDT(mydf)[, c(amount = sum(amount), 
                lapply(.SD[, grep("type", names(mydf), value = TRUE), with = FALSE], max)),
            by = id]
#    id amount type1 type2
# 1:  1      2     1     0
# 2:  2      4     0     1
# 3:  3      8     0     1

基本上，我们结合了使用 c 的操作，而 lapply(.SD, max) 代表在 dplyr 中的 mutate_each，而 matches 只是一个对 grep 的包装器（正如源代码中清楚地显示的那样）。with = FALSE 是用于在 data.table 或 .SD 父框架内标准评估列名的（其中SubData表示子数据）。

- David Arenburg

1

使用dplyr的更一般方法可能是：

mydf %>%
  group_by(id) %>%
  mutate_each('sum', amount) %>%
  mutate_each('max', matches("type")) %>%
  summarise_each('first', amount, matches("type"))

这样做的好处是每列只应用了一个聚合函数，与Veerendra Gadekar原始答案相同。如果我们需要sd或类似的内容来代替max，则此方法会很方便，而Hong Ooi的解决方案在这种情况下会出错。第三个优点是它会删除不参与计算的列。

另请参阅我的相关问题。

- arekolek

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Veerendra Gadekar · Accepted Answer

使用dplyr

library(dplyr)

mydf %>% 
     group_by(id) %>% 
     mutate(amount = sum(amount)) %>% 
     mutate_each(funs(max), matches("type")) %>%
     unique

#Source: local data table [3 x 4]

#  id amount type1 type2
#1  1      2     1     0
#2  2      4     0     1
#3  3      8     0     1

或者简单地说，就像@HongOoi所指出的那样。

mydf %>% 
     group_by(id) %>% 
     mutate(amount=sum(amount)) %>% 
     summarise_each(funs(max))