如何在dplyr的summarise_函数中使用标准评估

3

我已经查看了几个地方,但是我就是想不出如何做到这一点。它似乎已经改变了几次,所以更加令人困惑。

我想作为一个函数来总结Endoscopist的NumOfBx。我有以下数据帧:

vv <- structure(list(Endoscopist = c("John Boy ", "Jupi Ter ", "Jupi Ter ", 
"John Boy ", "John Boy ", "John Boy ", "Mar Gret ", "John Boy ", 
"Mar Gret ", "Phil Ip ", "Phil Ip "), NumbOfBx = c(2, 4, NA, 
2, 12, 12, NA, NA, NA, 3, NA)), row.names = 100:110, .Names = c("Endoscopist", 
"NumbOfBx"), class = "data.frame")

我的函数是:

NumBx <- function(x, y, z) {
  x <- data.frame(x)
  x <- x[!is.na(x[,y]), ]
  NumBxPlot <- x %>% group_by_(z) %>% summarise(avg = mean(y, na.rm = T))
}

我用以下方式调用它:

NumBx(vv,"Endoscopist","NumOfBx)

这个给了我一个错误:

Warning messages:
1: In mean.default(y, na.rm = T) :
  argument is not numeric or logical: returning NA
2: In mean.default(y, na.rm = T) :
  argument is not numeric or logical: returning NA
3: In mean.default(y, na.rm = T) :
  argument is not numeric or logical: returning NA

我将函数更改为使用summarise_,但结果仍然相同。后来我意识到需要特别使用summarise_(而不是group_by_),需要标准评估,于是我尝试了这个方法(来自这个stackoverflow示例

library(lazyeval)
NumBx <- function(x, y, z) {
  x <- data.frame(x)
  x <- x[!is.na(x[,y]), ]
  NumBxPlot <- x %>% group_by_(z) %>% 
      summarise_(sum_val = interp(~mean(y, na.rm = TRUE), var = as.name(y)))

但我仍然会得到相同的错误信息:
Warning messages:
1: In mean.default(y, na.rm = T) :
  argument is not numeric or logical: returning NA
2: In mean.default(y, na.rm = T) :
  argument is not numeric or logical: returning NA
3: In mean.default(y, na.rm = T) :
  argument is not numeric or logical: returning NA

我想要的输出是:

Endoscopist   Avg
Jupi Ter       4
John Boy       28
Phil Ip        3

尝试在你的summarise函数中使用get(y)。我测试过了,当尝试使用一个变量来引用列名时,我得到了相同的错误。get()函数解决了这个问题。你的group_by可能也需要做同样的处理。 - Balter
2个回答

2

使用 rlang(代替 lazyeval)可以这样做:

library(dplyr)

vv <- structure(list(Endoscopist = c("John Boy ", "Jupi Ter ", "Jupi Ter ", "John Boy ", "John Boy ", "John Boy ", "Mar Gret ", "John Boy ", "Mar Gret ", "Phil Ip ", "Phil Ip "), 
                     NumbOfBx = c(2, 4, NA, 2, 12, 12, NA, NA, NA, 3, NA)), 
                row.names = 100:110, .Names = c("Endoscopist", "NumbOfBx"), class = "data.frame")

num_bx <- function(.data, group, variable) {
    group <- enquo(group)
    variable <- enquo(variable)

    .data %>% 
        tidyr::drop_na(!!variable) %>% 
        group_by(!!group) %>% 
        summarise(avg = mean(!!variable))
}

vv %>% num_bx(Endoscopist, NumbOfBx)
#> # A tibble: 3 x 2
#>   Endoscopist   avg
#>         <chr> <dbl>
#> 1   John Boy      7
#> 2   Jupi Ter      4
#> 3    Phil Ip      3

如果你想将它们保留为字符串而不是未引用的名称,

num_bx <- function(.data, group, variable) {
    group <- rlang::sym(group)
    variable <- rlang::sym(variable)

    .data %>% 
        tidyr::drop_na(!!variable) %>% 
        group_by(!!group) %>% 
        summarise(avg = mean(!!variable))
}

vv %>% num_bx("Endoscopist", "NumbOfBx")
#> # A tibble: 3 x 2
#>   Endoscopist   avg
#>         <chr> <dbl>
#> 1   John Boy      7
#> 2   Jupi Ter      4
#> 3    Phil Ip      3

@aosmith 噢,是的,旧包名叫做lazyeval。"tidy eval"只是一个概念,不是一个包。已经进行了修正。 - alistaire

1

根据dplyr编程指南,定义您的函数如下:

NumBx <- function( x, y, z )
{
    yy <- enquo( y )
    zz <- enquo( z )

    data.frame(x) %>% filter( !is.na(!!yy) ) %>% group_by( !!zz ) %>%
        summarize( avg = mean(!!yy) )
}

现在你可以这样调用它:
NumBx( vv, NumbOfBx, Endoscopist )
#   Endoscopist   avg
#         <chr> <dbl>
# 1   John Boy      7
# 2   Jupi Ter      4
# 3    Phil Ip      3

一些注意事项:

  1. 您的调用参数顺序似乎颠倒了。您想按z分组,但是您将NumbOfBx作为z参数传递。
  2. na.rm=TRUE是多余的。您已经过滤掉了y变量为NA的行。
  3. John Boy的平均值应该是7,而不是您打算输出的28

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接