使用dplyr进行多个逻辑列的编程筛选的最佳实践

Question

使用dplyr进行多个逻辑列的编程筛选的最佳实践

3

需要解决的问题

我需要两个函数，在基于列指示器（即逻辑值）的数据框上实现和/或过滤器，这些指示器可能包含缺失值。函数的参数应该是要考虑的列的字符向量。

我的解决方案

filter_checked <- function(db, vars = NULL) {
  db %>%
    dplyr::filter(
      dplyr::if_all(dplyr::all_of(vars), ~ !is.na(.x) & .x)
    )
}


filter_or_checked <- function(db, vars = NULL) {
  db %>%
    dplyr::filter(
      dplyr::if_any(dplyr::all_of(vars), ~ !is.na(.x) & .x)
    )
}

示例测试通过

test_that("filter checks", {
  foo <- tibble::tibble(
    id = 1:5,
    a = c(TRUE, TRUE, FALSE, FALSE, FALSE),
    b = c(NA, TRUE, NA, TRUE, NA)
  )


  expect_equal(filter_checked(foo)[["id"]], 1:5)
  expect_equal(filter_checked(foo, "a")[["id"]], 1:2)
  expect_equal(filter_checked(foo, "b")[["id"]], c(2, 4))
  expect_equal(filter_checked(foo, c("a", "b"))[["id"]], 2)

})



test_that("filter_or_checks", {
  foo <- tibble::tibble(
    id = 1:5,
    a = c(TRUE, TRUE, FALSE, FALSE, FALSE),
    b = c(NA, TRUE, NA, TRUE, NA)
  )


  expect_equal(filter_or_checked(foo)[["id"]], integer(0))
  expect_equal(filter_or_checked(foo, "a")[["id"]], 1:2)
  expect_equal(filter_or_checked(foo, "b")[["id"]], c(2, 4))
  expect_equal(filter_or_checked(foo, c("a", "b"))[["id"]], c(1, 2, 4))

})

我的问题

我觉得我的函数非常复杂。不过，我认为这是我的知识不足。那么，是否有更好的（即更简单易读/理解/教授）tidyverse解决方案来解决这个问题呢？

- Corrado

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Guillaume · Accepted Answer

如果你觉得代码有趣。

解决方案是当你有多个布尔值（至少三个或更多）时，将它们全部转换为一列，使用0（FALSE）和1（TRUE），例如对于五个布尔值，它看起来像这样：

接下来:

要知道所有布尔值是否为TRUE，您可以计算每个单元格中有多少'1'，并要求有与列数相同数量的'1'
要知道是否至少有一列为TRUE，只需搜索字符串'1'

在我的情况下，我没有考虑缺失值。但是您可以将它们重新编码为2。

最后，这需要更多的数据准备和一个不那么复杂的函数（因为您不是在处理多个布尔值，而是一个字符字符串）。

代码可能如下所示：

library(dplyr)

# Prepare data, from your data 
foo <- tibble::tibble(
  id = 1:5,
  a = c(TRUE, TRUE, FALSE, FALSE, FALSE),
  b = c(NA, TRUE, NA, TRUE, NA),
  d_bis = c(TRUE, TRUE, FALSE, FALSE, FALSE),
  e_bis = c(TRUE, TRUE, FALSE, FALSE, FALSE),
  f_bis = c(TRUE, TRUE, FALSE, FALSE, FALSE)
) %>% 
  mutate(a_bis = a, b_bis = b) %>% # copy columns to test
  mutate_at(vars(ends_with('_bis')), as.integer) %>% # convert logicals to integers
  mutate_at(vars(ends_with('_bis')), tidyr::replace_na, replace = 2) %>% # replace NA with 2
  mutate(af_bis = paste0(a_bis, b_bis, d_bis, e_bis, f_bis))

# A tibble: 5 x 9
     id a     b     d_bis e_bis f_bis a_bis b_bis af_bis
  <int> <lgl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 
1     1 TRUE  NA        1     1     1     1     2 12111 
2     2 TRUE  TRUE      1     1     1     1     1 11111 
3     3 FALSE NA        0     0     0     0     2 02000 
4     4 FALSE TRUE      0     0     0     0     1 01000 
5     5 FALSE NA        0     0     0     0     2 02000


# list rows where at least one is TRUE
foo %>% 
  filter(grepl('1', af_bis))

# list rows where all columns are TRUE
foo %>% 
  filter(stringr::str_count(af_bis, '1') == 5L)

# list where at least one column is TRUE only if all columns are not missing
foo %>% 
  filter(grepl('1', af_bis) & ! grepl('2', af_bis))