使用mutate函数的多个字符串向量进行Dplyr标准评估

3

我正在尝试使用dplyr包,向一个mutate()调用提供一个包含多个列名的向量。下面是可重现的示例:

stackdf <- data.frame(jack = c(1,NA,2,NA,3,NA,4,NA,5,NA),
                      jill = c(1,2,NA,3,4,NA,5,6,NA,7),
                      jane = c(1,2,3,4,5,6,NA,NA,NA,NA))
two_names <- c('jack','jill')
one_name <- c('jack')

#   jack jill jane
#    1    1    1
#   NA    2    2
#    2   NA    3
#   NA    3    4
#    3    4    5
#   NA   NA    6
#    4    5   NA
#   NA    6   NA
#    5   NA   NA
#   NA    7   NA

我能够理解如何使用“单变量”版本,但是不知道如何扩展到多个变量?

# the below works as expected, and is an example of the output I desire
stackdf %>% rowwise %>% mutate(test = anyNA(c(jack,jill)))

# A tibble: 10 x 4
    jack  jill  jane  test
   <dbl> <dbl> <dbl> <lgl>
 1     1     1     1 FALSE
 2    NA     2     2  TRUE
 3     2    NA     3  TRUE
 4    NA     3     4  TRUE
 5     3     4     5 FALSE
 6    NA    NA     6  TRUE
 7     4     5    NA FALSE
 8    NA     6    NA  TRUE
 9     5    NA    NA  TRUE
10    NA     7    NA  TRUE


# using the one_name variable works if I evaluate it and then convert to 
# a name before unquoting it
stackdf %>% rowwise %>% mutate(test = anyNA(!!as.name(eval(one_name))))

# A tibble: 10 x 4
    jack  jill  jane  test
   <dbl> <dbl> <dbl> <lgl>
 1     1     1     1 FALSE
 2    NA     2     2  TRUE
 3     2    NA     3 FALSE
 4    NA     3     4  TRUE
 5     3     4     5 FALSE
 6    NA    NA     6  TRUE
 7     4     5    NA FALSE
 8    NA     6    NA  TRUE
 9     5    NA    NA FALSE
10    NA     7    NA  TRUE

我该如何扩展上面的方法,以便我可以使用two_names向量?只使用as.name只能接受一个对象,因此无法正常工作。
这里的问题类似于:在dplyr中传递变量名称的向量给arrange()。该解决方案“有效”,因为我可以使用以下代码:
two_names2 <- quos(c(jack, jill))
stackdf %>% rowwise %>% mutate(test = anyNA(!!!two_names2))

如果我必须直接键入c(jack, jill)而不是使用two_names变量,那么这就违背了它的目的。有没有类似的过程可以直接使用two_names呢?这个答案How to pass a named vector to dplyr::select using quosures?使用了rlang::syms,但虽然这对于选择变量(如stackdf %>% select(!!! rlang::syms(two_names)))有效,但似乎不能用于在变异时提供参数(如stackdf %>% rowwise %>% mutate(test = anyNA(!!! rlang::syms(two_names))))。这个答案类似但不起作用:How to evaluate a constructed string with non-standard evaluation using dplyr?

2个回答

7
你可以使用rlang::syms(由dplyr重新导出;也可直接调用)将字符串转换为quosures。
library(dplyr)

stackdf <- data.frame(jack = c(1,NA,2,NA,3,NA,4,NA,5,NA),
                      jill = c(1,2,NA,3,4,NA,5,6,NA,7),
                      jane = c(1,2,3,4,5,6,NA,NA,NA,NA))
two_names <- c('jack','jill')

stackdf %>% rowwise %>% mutate(test = anyNA(c(!!!syms(two_names))))
#> Source: local data frame [10 x 4]
#> Groups: <by row>
#> 
#> # A tibble: 10 x 4
#>     jack  jill  jane test 
#>    <dbl> <dbl> <dbl> <lgl>
#>  1    1.    1.    1. FALSE
#>  2   NA     2.    2. TRUE 
#>  3    2.   NA     3. TRUE 
#>  4   NA     3.    4. TRUE 
#>  5    3.    4.    5. FALSE
#>  6   NA    NA     6. TRUE 
#>  7    4.    5.   NA  FALSE
#>  8   NA     6.   NA  TRUE 
#>  9    5.   NA    NA  TRUE 
#> 10   NA     7.   NA  TRUE

另外,您也可以使用少量的基础R而不是整洁的评估:

stackdf %>% mutate(test = rowSums(is.na(.[two_names])) > 0)
#>    jack jill jane  test
#> 1     1    1    1 FALSE
#> 2    NA    2    2  TRUE
#> 3     2   NA    3  TRUE
#> 4    NA    3    4  TRUE
#> 5     3    4    5 FALSE
#> 6    NA   NA    6  TRUE
#> 7     4    5   NA FALSE
#> 8    NA    6   NA  TRUE
#> 9     5   NA   NA  TRUE
#> 10   NA    7   NA  TRUE

...这将更快,因为对于每行迭代使用rowwise比进行一次向量化的调用n次更有效率。


这实际上很接近,但当我尝试这种方法时遇到了同样的问题。你能看到我的答案吗?调用list()函数似乎可以工作,但我不清楚为什么as.listlist在这里有不同的行为。 - Brandon
1
糟糕,我们都忘记了 c!!! 将它们拼接为单独的参数,因此第二个参数被传递到 anyNArecursive 参数中,这是不正确的。 - alistaire
为什么需要rowwisemutate不是已经逐行操作了吗? - JelenaČuklina
1
@Jelena-bioinf 不是固有的;它取决于其中使用的函数,例如 + 返回一个向量,但 sum 将所有内容折叠成单个数字(就像这里的 anyNA 一样,因此如果没有 rowwise,它将返回单个 TRUE,而 mutate 将将其循环利用到列中)。总是有替代方案来代替 rowwise(现在已经不鼓励使用了),但它们通常需要更深入的函数群和对函数式编程的理解。 - alistaire

6

解决这个问题的关键有几点:

  • 访问字符向量中的字符串,并将其与 dplyr 一起使用
  • 在使用 mutate 函数时提供参数的格式,这里是 anyNA

这里的目标是复制此调用,但使用命名变量 two_names 而不是手动输入 c(jack,jill)

stackdf %>% rowwise %>% mutate(test = anyNA(c(jack,jill)))

# A tibble: 10 x 4
    jack  jill  jane  test
   <dbl> <dbl> <dbl> <lgl>
 1     1     1     1 FALSE
 2    NA     2     2  TRUE
 3     2    NA     3  TRUE
 4    NA     3     4  TRUE
 5     3     4     5 FALSE
 6    NA    NA     6  TRUE
 7     4     5    NA FALSE
 8    NA     6    NA  TRUE
 9     5    NA    NA  TRUE
10    NA     7    NA  TRUE

1. 使用dplyr动态变量

  1. Using quo/quos: Does not accept strings as input. The solution using this method would be:

    two_names2 <- quos(c(jack, jill))
    stackdf %>% rowwise %>% mutate(test = anyNA(!!! two_names2))
    

    Note that quo takes a single argument, and thus is unquoted using !!, and for multiple arguments you can use quos and !!! respectively. This is not desirable because I do not use two_names and instead have to type out the columns I wish to use.

  2. Using as.name or rlang::sym/rlang::syms: as.name and sym take only a single input, however syms will take multiple and return a list of symbolic objects as output.

    > two_names
    [1] "jack" "jill"
    > as.name(two_names)
    jack
    > syms(two_names)
    [[1]]
    jack
    
    [[2]]
    jill
    

    Note that as.name ignores everything after the first element. However, syms appears to work appropriately here, so now we need to use this within the mutate call.

2. 使用 anyNA 或其他变量在 mutate 中使用动态变量

  1. Using syms and anyNA directly does not actually produce the correct result.

    > stackdf %>% rowwise %>% mutate(test = anyNA(!!! syms(two_names)))
        jack  jill  jane  test
       <dbl> <dbl> <dbl> <lgl>
     1     1     1     1 FALSE
     2    NA     2     2  TRUE
     3     2    NA     3 FALSE
     4    NA     3     4  TRUE
     5     3     4     5 FALSE
     6    NA    NA     6  TRUE
     7     4     5    NA FALSE
     8    NA     6    NA  TRUE
     9     5    NA    NA FALSE
    10    NA     7    NA  TRUE
    

    Inspection of the test shows that this is only taking into account the first element, and ignoring the second element. However, if I use a different function, eg sum or paste0, it is clear that both elements are being used:

    > stackdf %>% rowwise %>% mutate(test = sum(!!! syms(two_names), 
                                                na.rm = TRUE))
        jack  jill  jane  test
       <dbl> <dbl> <dbl> <dbl>
     1     1     1     1     2
     2    NA     2     2     2
     3     2    NA     3     2
     4    NA     3     4     3
     5     3     4     5     7
     6    NA    NA     6     0
     7     4     5    NA     9
     8    NA     6    NA     6
     9     5    NA    NA     5
    10    NA     7    NA     7
    

    The reason for this becomes clear when you look at the arguments for anyNA vs sum.

    function (x, recursive = FALSE) .Primitive("anyNA")

    function (..., na.rm = FALSE) .Primitive("sum")

    anyNA expects a single object x, whereas sum can take a variable list of objects (...).

  2. Simply supplying c() fixes this problem (see answer from alistaire).

    > stackdf %>% rowwise %>% mutate(test = anyNA(c(!!! syms(two_names))))
        jack  jill  jane  test
       <dbl> <dbl> <dbl> <lgl>
     1     1     1     1 FALSE
     2    NA     2     2  TRUE
     3     2    NA     3  TRUE
     4    NA     3     4  TRUE
     5     3     4     5 FALSE
     6    NA    NA     6  TRUE
     7     4     5    NA FALSE
     8    NA     6    NA  TRUE
     9     5    NA    NA  TRUE
    10    NA     7    NA  TRUE
    
  3. Alternately... for educational purposes, one could use a combination of sapply, any, and anyNA to produce the correct result. Here we use list so that the results are provided as a single list object.

    # this produces an error an error because the elements of !!!
    # are being passed to the arguments of sapply (X =, FUN = )
    > stackdf %>% rowwise %>% 
        mutate(test = any(sapply(!!! syms(two_names), anyNA)))
    Error in mutate_impl(.data, dots) : 
      Evaluation error: object 'jill' of mode 'function' was not found.
    

    Supplying list fixes this problem because it binds all the results into a single object.

    # the below table is the familiar incorrect result that uses only the `jack`
    > stackdf %>% rowwise %>% 
        mutate(test = any(sapply(X=as.list(!!! syms(two_names)), 
                                 FUN=anyNA)))
    
        jack  jill  jane  test
       <dbl> <dbl> <dbl> <lgl>
     1     1     1     1 FALSE
     2    NA     2     2  TRUE
     3     2    NA     3 FALSE
     4    NA     3     4  TRUE
     5     3     4     5 FALSE
     6    NA    NA     6  TRUE
     7     4     5    NA FALSE
     8    NA     6    NA  TRUE
     9     5    NA    NA FALSE
    10    NA     7    NA  TRUE
    
    # this produces the correct answer
    > stackdf %>% rowwise %>% 
        mutate(test = any(X = sapply(list(!!! syms(two_names)), 
                          FUN = anyNA)))
    
    jack  jill  jane  test
    <dbl> <dbl> <dbl> <lgl>
     1     1     1     1 FALSE
     2    NA     2     2  TRUE
     3     2    NA     3  TRUE
     4    NA     3     4  TRUE
     5     3     4     5 FALSE
     6    NA    NA     6  TRUE
     7     4     5    NA FALSE
     8    NA     6    NA  TRUE
     9     5    NA    NA  TRUE
    10    NA     7    NA  TRUE
    

    Understanding why these two perform differently make sense when their behavior is compared!

    > as.list(two_names)
    [[1]]
    [1] "jack"
    
    [[2]]
    [1] "jill"
    
    > list(two_names)
    [[1]]
    [1] "jack" "jill"
    

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接