使用dplyr summarise()函数从单个函数返回多个值

Question

使用dplyr summarise()函数从单个函数返回多个值

47

我想知道是否有一种方法可以在 dplyr 0.1.2 中使用返回多个值的函数 (例如来自 psych 包的 describe 函数) 来进行汇总 (summarise)。如果不行，是因为它还没有被实现，还是因为这不是一个好主意？

示例：

require(psych)
require(ggplot2)
require(dplyr)

dgrp <- group_by(diamonds, cut)
describe(dgrp$price)
summarise(dgrp, describe(price))

产生错误：期望单个值

- jzadra

3个回答

12

在最新版本的 tidyverse 中，这是可能的。首先，在您提供的示例中，该函数返回一个一行数据框。如果我们在 summarize() 中使用这样一个函数，它会生成一个数据框列，我们可以通过 unpack() 将其转换为单独的列。

library(tidyverse)
library(psych)

describe(diamonds$price)
#>    vars     n   mean      sd median trimmed     mad min   max range skew
#> X1    1 53940 3932.8 3989.44   2401 3158.99 2475.94 326 18823 18497 1.62
#>    kurtosis    se
#> X1     2.18 17.18

diamonds %>%
  group_by(cut) %>%
  summarize(descr = describe(price)) %>%
  unpack(cols = descr)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 5 x 14
#>   cut    vars     n  mean    sd median trimmed   mad   min   max range  skew
#>   <ord> <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair      1  1610 4359. 3560.  3282    3696. 2183.   337 18574 18237  1.78
#> 2 Good      1  4906 3929. 3682.  3050.   3252. 2853.   327 18788 18461  1.72
#> 3 Very…     1 12082 3982. 3936.  2648    3243. 2855.   336 18818 18482  1.60
#> 4 Prem…     1 13791 4584. 4349.  3185    3822. 3371.   326 18823 18497  1.33
#> 5 Ideal     1 21551 3458. 3808.  1810    2656. 1631.   326 18806 18480  1.84
#> # … with 2 more variables: kurtosis <dbl>, se <dbl>

其次，在某些情况下，函数只是将向量作为输出返回。在这些情况下，summarize() 会为每个生成的值生成一行新行。

set.seed(1234)
dsmall <- diamonds[sample(nrow(diamonds), 25), ]

unique(dsmall$clarity)
#> [1] I1   SI2  VVS2 VS1  VVS1 VS2  SI1  IF  
#> Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF

dsmall %>%
  group_by(cut) %>%
  summarize(clarity = unique(clarity))
#> `summarise()` regrouping output by 'cut' (override with `.groups` argument)
#> # A tibble: 17 x 2
#> # Groups:   cut [4]
#>    cut       clarity
#>    <ord>     <ord>  
#>  1 Good      I1     
#>  2 Good      SI2    
#>  3 Good      VS1    
#>  4 Good      SI1    
#>  5 Very Good VVS2   
#>  6 Very Good SI2    
#>  7 Very Good VS1    
#>  8 Very Good IF     
#>  9 Premium   SI2    
#> 10 Premium   SI1    
#> 11 Ideal     VS1    
#> 12 Ideal     VVS1   
#> 13 Ideal     VS2    
#> 14 Ideal     VVS2   
#> 15 Ideal     SI1    
#> 16 Ideal     SI2    
#> 17 Ideal     IF

^{使用reprex package（v0.3.0）于2020年07月14日创建}

- Claus Wilke

是的，我看到了。这是dplyr的一个很好的增强！ - jzadra

@ClausWilke，这还能用吗？它给我返回了“错误：summarise()输入descr存在问题。x 输入必须是向量，而不是describe对象。” - YBS

必须确保你使用的是 tidyr::unpack 而不是 matrix::unpack 才能使这个工作正常运作。 - RobS

https://dplyr.tidyverse.org/reference/summarise.html 从 dplyr 1.0.0 版本开始提供。 - qwr

0

一个更简单的选择是利用dplyr包，将函数参数返回为tibble。

例如：

meanc <- function(x){tibble(xmean=mean(x),xsd=sd(x))}

db_sum <- iris %>% group_by(Species) %>% summarize(meanc(Petal.Width))

- Luc

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Artem Klevtsov · Accepted Answer

使用 dplyr >= 0.2 版本，我们可以使用 do 函数来实现：

library(ggplot2)
library(psych)
library(dplyr)
diamonds %>%
    group_by(cut) %>%
    do(describe(.$price)) %>%
    select(-vars)
#> Source: local data frame [5 x 13]
#> Groups: cut [5]
#> 
#>         cut     n     mean       sd median  trimmed      mad   min   max range     skew kurtosis       se
#>      (fctr) (dbl)    (dbl)    (dbl)  (dbl)    (dbl)    (dbl) (dbl) (dbl) (dbl)    (dbl)    (dbl)    (dbl)
#> 1      Fair  1610 4358.758 3560.387 3282.0 3695.648 2183.128   337 18574 18237 1.780213 3.067175 88.73281
#> 2      Good  4906 3928.864 3681.590 3050.5 3251.506 2853.264   327 18788 18461 1.721943 3.042550 52.56197
#> 3 Very Good 12082 3981.760 3935.862 2648.0 3243.217 2855.488   336 18818 18482 1.595341 2.235873 35.80721
#> 4   Premium 13791 4584.258 4349.205 3185.0 3822.231 3371.432   326 18823 18497 1.333358 1.072295 37.03497
#> 5     Ideal 21551 3457.542 3808.401 1810.0 2656.136 1630.860   326 18806 18480 1.835587 2.977425 25.94233

基于purrr（2017年以来使用purrrlyr包）包的解决方案:

library(ggplot2)
library(psych)
library(purrr)
diamonds %>% 
    slice_rows("cut") %>% 
    by_slice(~ describe(.x$price), .collate = "rows")
#> Source: local data frame [5 x 14]
#> 
#>         cut  vars     n     mean       sd median  trimmed      mad   min   max range     skew kurtosis       se
#>      (fctr) (dbl) (dbl)    (dbl)    (dbl)  (dbl)    (dbl)    (dbl) (dbl) (dbl) (dbl)    (dbl)    (dbl)    (dbl)
#> 1      Fair     1  1610 4358.758 3560.387 3282.0 3695.648 2183.128   337 18574 18237 1.780213 3.067175 88.73281
#> 2      Good     1  4906 3928.864 3681.590 3050.5 3251.506 2853.264   327 18788 18461 1.721943 3.042550 52.56197
#> 3 Very Good     1 12082 3981.760 3935.862 2648.0 3243.217 2855.488   336 18818 18482 1.595341 2.235873 35.80721
#> 4   Premium     1 13791 4584.258 4349.205 3185.0 3822.231 3371.432   326 18823 18497 1.333358 1.072295 37.03497
#> 5     Ideal     1 21551 3457.542 3808.401 1810.0 2656.136 1630.860   326 18806 18480 1.835587 2.977425 25.94233

但使用 data.table 就非常简单了：

as.data.table(diamonds)[, describe(price), by = cut]
#>          cut vars     n     mean       sd median  trimmed      mad min   max range     skew kurtosis       se
#> 1:     Ideal    1 21551 3457.542 3808.401 1810.0 2656.136 1630.860 326 18806 18480 1.835587 2.977425 25.94233
#> 2:   Premium    1 13791 4584.258 4349.205 3185.0 3822.231 3371.432 326 18823 18497 1.333358 1.072295 37.03497
#> 3:      Good    1  4906 3928.864 3681.590 3050.5 3251.506 2853.264 327 18788 18461 1.721943 3.042550 52.56197
#> 4: Very Good    1 12082 3981.760 3935.862 2648.0 3243.217 2855.488 336 18818 18482 1.595341 2.235873 35.80721
#> 5:      Fair    1  1610 4358.758 3560.387 3282.0 3695.648 2183.128 337 18574 18237 1.780213 3.067175 88.73281

我们可以编写自己的摘要函数，它返回一个列表：

fun <- function(x) {
    list(n = length(x),
         min = min(x),
         median = as.numeric(median(x)),
         mean = mean(x),
         sd = sd(x),
         max = max(x))
}
as.data.table(diamonds)[, fun(price), by = cut]
#>          cut     n min median     mean       sd   max
#> 1:     Ideal 21551 326 1810.0 3457.542 3808.401 18806
#> 2:   Premium 13791 326 3185.0 4584.258 4349.205 18823
#> 3:      Good  4906 327 3050.5 3928.864 3681.590 18788
#> 4: Very Good 12082 336 2648.0 3981.760 3935.862 18818
#> 5:      Fair  1610 337 3282.0 4358.758 3560.387 18574