使用dplyr::filter创建R函数的问题

Question

使用dplyr::filter创建R函数的问题

rfilterdplyrrlangtidyeval

7

我查看了其他答案，但找不到下面的代码工作的解决方案。基本上，我正在创建一个函数，该函数将两个数据帧进行inner_join ，并根据函数中输入的列对其进行filter。

问题在于函数的filter部分无法正常工作。但是，如果我将过滤器从函数中删除并像mydiff("a") %>% filter(a.x != a.y)这样追加它，则可以正常工作。

任何建议都有帮助。

请注意，我用引号括起来的是函数输入

library(dplyr)

# fake data
df1<- tibble(id = seq(4,19,2), 
             a = c("a","b","c","d","e","f","g","h"), 
             b = c(rep("foo",3), rep("bar",5)))
df2<- tibble(id = seq(10, 20, 1), 
             a = c("d","a", "e","f","k","m","g","i","h", "a", "b"),
             b = c(rep("bar", 7), rep("foo",4)))

# What I am trying to do
dplyr::inner_join(df1, df2, by = "id") %>% select(id, b.x, b.y) %>% filter(b.x!=b.y)

#> # A tibble: 1 x 3
#>      id b.x   b.y  
#>   <dbl> <chr> <chr>
#> 1    18 bar   foo

# creating a function so that I can filter by difference in column if I have more columns
mydiff <- function(filteron, df_1 = df1, df_2 = df2){
  require(dplyr, warn.conflicts = F)
  col_1 = paste0(quo_name(filteron), "x")
  col_2 = paste0(quo_name(filteron), "y")
  my_df<- inner_join(df_1, df_2, by = "id", suffix = c("x", "y"))
  my_df %>% select(id, col_1, col_2) %>% filter(col_1 != col_2)
}

# the filter part is not working as expected. 
# There is no difference whether i pipe filter or leave it out
mydiff("a")

#> # A tibble: 5 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    10 d     d    
#> 2    12 e     e    
#> 3    14 f     k    
#> 4    16 g     g    
#> 5    18 h     h

- x85ms16

5个回答

5

来自https://dplyr.tidyverse.org/articles/programming.html

大多数dplyr函数使用非标准评估（NSE）。这是一个总称，意味着它们不遵循通常的R评估规则。

当试图将它们包装在函数中时，这有时会创建一些问题。这是您创建的函数的基本版本。

mydiff<- function(filteron, df_1=df1, df_2 = df2){

                 col_1 = paste0(filteron,"x")
                 col_2 = paste0(filteron, "y")

                 my_df <- merge(df1, df2, by="id", suffixes = c("x","y"))

                 my_df[my_df[, col_1] != my_df[, col_2], c("id", col_1, col_2)]  
         }

> mydiff("a")
  id ax ay
3 14  f  k
> mydiff("b")
  id  bx  by
5 18 bar foo

这将解决您的问题，并且很可能按照预期工作，现在和未来都是如此。减少对外部包的依赖，可以降低这类问题和其他小问题的发生概率，因为随着包作者不断改进他们的工作，这些问题可能会出现。

- Justin

我认为关于放弃使用dplyr的建议需要平衡考虑这样做的缺点，主要是失去了将代码移植到不同数据源的可移植性。 - Lionel Henry

3

有趣的观点。但或许弃用dplyr会扩大代码的可移植性，因为不使用它编写函数会更简单、更可预测、更一致。由于函数是包的构建模块，而包仍然是将R代码发送给他人的黄金标准，因此基本代码比dplyr更具可移植性，可以覆盖更广泛的数据源。 - Justin

1

@lionel 不使用dplyr会对代码的可移植性产生什么影响？ - meh

我在谈论dplyr后端。 - Lionel Henry

@Justin 这是一个很好的替代方案，可以解决我的问题。谢谢你。 - x85ms16

1

看起来是一个评估问题。尝试使用 lazyeval 包修改 mydiff 函数：

mydiff <- function(filteron, df_1 = df1, df_2 = df2){
  require(dplyr, warn.conflicts = F)
  col_1 <- paste0(quo_name(filteron), "x")
  col_2 <- paste0(quo_name(filteron), "y")
  criteria <- lazyeval::interp(~ x != y, .values = list(x = as.name(col_1), y = as.name(col_2)))
  my_df <- inner_join(df_1, df_2, by = "id", suffix = c("x", "y"))
  my_df %>% select(id, col_1, col_2) %>% filter_(criteria)
}

你可以查看 Hadley Wickham 的书《Advanced R》中的函数章节，了解更多相关内容。

- Augusto Fadel

1

使用基本的R语言编写简单函数的建议是好的，但是对于更复杂的tidyverse函数来说并不适用，同时会失去与dplyr后端（如数据库）的可移植性。如果您想要围绕tidyverse管道创建函数，您需要学习一些关于R表达式和解引用运算符!!的知识。我建议您浏览https://tidyeval.tidyverse.org的前几节，以了解这里使用的概念。

由于您想要创建的函数接受一个裸列名，并且不涉及复杂的表达式（例如您将传递给mutate()或summarise()），我们不需要像quosures这样的高级技巧。我们可以使用符号进行操作。要创建一个符号，请使用as.name()或rlang::sym()。

as.name("mycolumn")
#> mycolumn

rlang::sym("mycolumn")
#> mycolumn

后者的优点是它是一个更大的函数族的一部分: ensym(), 以及复数形式的变量 syms() 和 ensyms()。我们将使用 ensym() 来捕获列名，即延迟执行列以便在进行一些转换后将其传递给 dplyr。延迟执行被称为“引用”。

我对您的函数接口进行了一些更改:

首先处理数据框，以保持与 dplyr 函数的一致性
不要为数据框提供默认值。这些默认值做出了太多的假设。
使by和suffix可由用户配置，并设置合理的默认值。

以下是代码及其解释：

mydiff <- function(df1, df2, var, by = "id", suffix = c(".x", ".y")) {
  stopifnot(is.character(suffix), length(suffix) == 2)

  # Let's start by the easy task, joining the data frames
  df <- dplyr::inner_join(df1, df2, by = by, suffix = suffix)

  # Now onto dealing with the diff variable. `ensym()` takes a column
  # name and delays its execution:
  var <- rlang::ensym(var)

  # A delayed column name is not a string, it's a symbol. So we need
  # to transform it to a string in order to work with paste() etc.
  # `quo_name()` works in this case but is generally only for
  # providing default names.
  #
  # Better use base::as.character() or rlang::as_string() (the latter
  # works a bit better on Windows with foreign UTF-8 characters):
  var_string <- rlang::as_string(var)

  # Now let's add the suffix to the name:
  col1_string <- paste0(var_string, suffix[[1]])
  col2_string <- paste0(var_string, suffix[[2]])

  # dplyr::select() supports column names as strings but it is an
  # exception in the dplyr API. Generally, dplyr functions take bare
  # column names, i.e. symbols. So let's transform the strings back to
  # symbols:
  col1 <- rlang::sym(col1_string)
  col2 <- rlang::sym(col2_string)

  # The delayed column names now need to be inserted back into the
  # dplyr code. This is accomplished by unquoting with the !!
  # operator:
  df %>%
    dplyr::select(id, !!col1, !!col2) %>%
    dplyr::filter(!!col1 != !!col2)
}

mydiff(df1, df2, b)
#> # A tibble: 1 x 3
#>      id b.x   b.y
#>   <dbl> <chr> <chr>
#> 1    18 bar   foo

mydiff(df1, df2, "a")
#> # A tibble: 1 x 3
#>      id a.x   a.y
#>   <dbl> <chr> <chr>
#> 1    14 f     k

你也可以通过使用字符串而不是列名来简化函数。在这个版本中，我将使用syms()创建一个符号列表，并使用!!!一次性传递给select()：

mydiff2 <- function(df1, df2, var, by = "id", suffix = c(".x", ".y")) {
  stopifnot(
    is.character(suffix), length(suffix) == 2,
    is.character(var), length(var) == 1
  )

  # Create a list of symbols from a character vector:
  cols <- rlang::syms(paste0(var, suffix))

  df <- dplyr::inner_join(df1, df2, by = by, suffix = suffix)

  # Unquote the whole list as once with the big bang !!!
  df %>%
    dplyr::select(id, !!!cols) %>%
    dplyr::filter(!!cols[[1]] != !!cols[[2]])
}

mydiff2(df1, df2, "a")
#> # A tibble: 1 x 3
#>      id a.x   a.y
#>   <dbl> <chr> <chr>
#> 1    14 f     k

- Lionel Henry

很棒的回答！不知道为什么会被踩。我们可以使用 quo_name(enquo(var)) 来使函数 mydiff2 更加灵活，即可以接受字符串或符号作为输入吗？ - Tung

在这种情况下它不会工作，因为输入必须遵循select语法，因为它们被转发到select()。同时，它们也不能使用select helpers，因为它们也被转发到了filter()。因此，它们的共同点只是裸露的列名。这是一个好点子，我会把它加入到书本里。 - Lionel Henry

在这种情况下，我更喜欢使用mydiff()而不是mydiff2()，因为它更加灵活 :) - Tung

@lionel，这非常有帮助。感谢您详细解释事情。现在我可以在编写更复杂的dplyr函数时参考这个。您关于参数和引用的提示非常有帮助。一个问题，每次使用参数中的列时，您是否需要使用!!或!!!取消引用？ - x85ms16

是的，围绕dplyr管道创建函数的主要模式是“引用和取消引用”。您可以使用enquo()或ensym()（或它们的复数变体）进行引用，使用!!和!!!进行取消引用。 - Lionel Henry

1

首先找到 col_1 != col_2 的索引可能已经足够解决这个问题。

mydiff <- function(filteron, df_1 = df1, df_2 = df2){
  require(dplyr, warn.conflicts = F)
  col_1 <- paste0(quo_name(filteron), "x")
  col_2 <- paste0(quo_name(filteron), "y")
  my_df <-
    inner_join(df_1, df_2, by = "id", suffix = c("x", "y")) %>%
    select(id, col_1, col_2)
  # find indices of different columns
  same <- my_df[, col_1] != my_df[, col_2]
  # return for the rows
  my_df[same, ]
}
my_diff("a")
#> # A tibble: 1 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    14 f     k

- younggeun

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tung · Accepted Answer

你原来的函数没有起作用是因为col_1是一个字符串，但是dplyr::filter()期望LHS为“未加引号”的输入变量。因此，你需要先使用sym()将col_1转换为变量，然后在filter内使用!!（叹号）对其进行去引号处理。 rlang有一个非常好的qq_show函数，可以展示引号/去引号的实际情况（请参见下面的输出）

还可以查看这个类似的问题

library(rlang)
library(dplyr)

# creating a function that can take either string or symbol as input
mydiff <- function(filteron, df_1 = df1, df_2 = df2) {

  col_1 <- paste0(quo_name(enquo(filteron)), "x")
  col_2 <- paste0(quo_name(enquo(filteron)), "y")

  my_df <- inner_join(df_1, df_2, by = "id", suffix = c("x", "y"))

  cat('\nwithout sym and unquote\n')
  qq_show(col_1 != col_2)

  cat('\nwith sym and unquote\n')
  qq_show(!!sym(col_1) != !!sym(col_2))
  cat('\n')

  my_df %>% 
    select(id, col_1, col_2) %>% 
    filter(!!sym(col_1) != !!sym(col_2))
}

### testing: filteron as a string
mydiff("a")
#> 
#> without sym and unquote
#> col_1 != col_2
#> 
#> with sym and unquote
#> ax != ay
#> 
#> # A tibble: 1 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    14 f     k

### testing: filteron as a symbol
mydiff(a)
#> 
#> without sym and unquote
#> col_1 != col_2
#> 
#> with sym and unquote
#> ax != ay
#>  
#> # A tibble: 1 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    14 f     k

^{此内容由reprex软件包（版本0.2.1.9000）创建于2018年9月28日。}