创建一个带参数的函数，传递给dplyr :: filter，如何最好地解决nse的问题？

Question

创建一个带参数的函数，传递给dplyr :: filter，如何最好地解决nse的问题？

9

非标准评估在使用dplyr动词时非常方便。但是在使用这些动词与函数参数时可能会有问题。例如，假设我想创建一个函数，以给定物种的行数作为输出结果。

# Load packages and prepare data
library(dplyr)
library(lazyeval)
# I prefer lowercase column names
names(iris) <- tolower(names(iris))
# Number of rows for all species
nrow(iris)
# [1] 150

示例无法正常工作

该函数不能按预期工作，因为species在上下文中被解释为鸢尾花数据框的一部分，而不是被解释为函数参数的一部分：

nrowspecies0 <- function(dtf, species){
    dtf %>%
        filter(species == species) %>%
        nrow()
}
nrowspecies0(iris, species = "versicolor")
# [1] 150

3个实现示例

为了解决非标准评估问题，我通常会在参数后添加下划线：

nrowspecies1 <- function(dtf, species_){
    dtf %>%
        filter(species == species_) %>%
        nrow()
}

nrowspecies1(iris, species_ = "versicolor")
# [1] 50
# Because of function name completion the argument
# species works too
nrowspecies1(iris, species = "versicolor")
# [1] 50

这种方法并不是完全令人满意，因为它将函数参数的名称更改为不太用户友好的内容。或者它依赖于自动完成，这在编程中可能不是一个好习惯。为了保持一个漂亮的参数名称，我可以这样做：

nrowspecies2 <- function(dtf, species){
    species_ <- species
    dtf %>%
        filter(species == species_) %>%
        nrow()
}
nrowspecies2(iris, species = "versicolor")
# [1] 50

解决非标准评估的另一种方法, 基于这个答案. interp()在函数环境中解释species:

nrowspecies3 <- function(dtf, species){
    dtf %>%
        filter_(interp(~species == with_species, 
                       with_species = species)) %>%
        nrow()
}
nrowspecies3(iris, species = "versicolor")
# [1] 50

考虑以上三个功能，实现此过滤函数的最佳和最强大的方式是什么？还有其他方法吗？

- Paul Rougieux

数据框列名的引号是我开始更喜欢Python的原因之一。请参见Tidyverse风格的pandas："“Tidyverse允许混合引用和未引用的变量名称。在我的（不）经验中，这带来的便利性伴随着同样的困惑。在我看来，如果所有变量都像pandas一样始终带引号，那么tidyeval解决的许多问题可能就不存在了，但我可能会错过更深层次的真相......”" - Paul Rougieux

3个回答

5

这个问题与非标准评估完全无关。让我重新编写您的初始函数，以明确这一点：

nrowspecies4 <- function(dtf, boo){
    dtf %>%
        filter(boo == boo) %>%
        nrow()
}
nrowspecies4(iris, boo = "versicolor")
#150

你的filter中的表达式总是评估为TRUE（几乎总是 - 请参见下面的示例），这就是它不起作用的原因，而不是因为某些NSE魔法。

你的nrowspecies2是正确的方法。

顺便说一句，在你的nrowspecies0中的species确实被评估为一个列，而不是输入变量species，你可以通过比较nrowspecies0(iris, NA)和nrowspecies4(iris, NA)来检查。

- eddi

不确定为什么，但这对我没有起作用。最终我使用了下面答案中建议的filter_。（附注：我的函数还使用了group_by并将结果进一步传递，所以可能是这个原因） - jjj

1

在他2016年的UseR演讲（@38分30秒），Hadley Wickham解释了引用透明度的概念。使用一个公式，过滤函数可以被重新表述为：

nrowspecies5 <- function(dtf, formula){
    dtf %>%
        filter_(formula) %>%
        nrow()
}

这样做的另一个好处是更加通用。

# Make column names lower case
names(iris) = tolower(names(iris)) 
nrowspecies5(iris, ~ species == "versicolor")
# 50
nrowspecies5(iris, ~ sepal.length > 6 & species == "virginica")
# 41
nrowspecies5(iris, ~ sepal.length > 6 & species == "setosa")
# 0

- Paul Rougieux

这会抛出错误Error: object 'species' not found。 - daaronr

这是因为我喜欢将所有列名都转换为小写，我已经更新了答案，使用 names(iris) = tolower(names(iris))。不过，filter_() 已经被弃用了，所以我应该更彻底地修改答案。 - Paul Rougieux

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jaimedash · Accepted Answer

@eddi的回答正确说明了这里发生的事情。我写了另一个回答来解决如何使用dplyr动词编写函数的更大需求。你会注意到，最终使用类似nrowspecies2的东西来避免species == species的等价性。

要编写一个包装dplyr动词的函数，以便在NSE上工作，请编写两个函数：首先编写一个需要引用输入的版本，使用lazyeval和dplyr动词的SE版本。所以在这种情况下，filter_。

nrowspecies_robust_ <- function(data, species){ 
  species_ <- lazyeval::as.lazy(species) 
  condition <- ~ species == species_ # *
  tmp <- dplyr::filter_(data, condition) # **
  nrow(tmp)
} 
nrowspecies_robust_(iris, ~versicolor)

其次，制作一个使用NSE的版本：

nrowspecies_robust <- function(data, species) { 
  species <- lazyeval::lazy(species) 
  nrowspecies_robust_(data, species) 
} 
nrowspecies_robust(iris, versicolor)

* = 如果您想要做更复杂的事情，可能需要在下面链接的提示中使用 lazyeval::interp

** = 此外，如果您需要更改输出名称，请查看 .dots 参数

对于上述内容，我遵循了 Hadley 的一些提示（链接）
另一个很好的资源是 dplyr 的 NSE 说明文档（链接），其中介绍了 .dots、interp 和其他来自 lazyeval 包的函数
有关 lazyeval 的更多详细信息，请参见它的说明文档
有关使用 NSE 的基本 R 工具的全面讨论（其中许多工具可以通过使用 lazyeval 避免），请查看 Advanced R 中关于 NSE 的章节（链接）