使用str_detect R函数检测多个字符串

Question

使用str_detect R函数检测多个字符串

59

我想查找多个字符串并将它们放入一个变量中，但是我一直遇到错误。

queries <- httpdf %>% filter(str_detect(payload, "create" || "drop" || "select"))
Error: invalid 'x' type in 'x || y'

queries <- httpdf %>% filter(str_detect(payload, "create" | "drop" | "select"))
Error: operations are possible only for numeric, logical or complex types

queries1 <- httpdf %>% filter(str_detect(payload, "create", "drop", "select"))
Error: unused arguments ("drop", "select")

这些方法都不起作用。是否有其他使用 str_detect 的方法或者我应该尝试其他的方法？我希望它们也能出现在同一列中。

- Magick.M

15

我猜您需要执行 paste(c('create', 'drop', 'select'), collapse="|") 的代码。 - akrun

3个回答

45

这是解决这个问题的方法：

queries1 <- httpdf %>% 
  filter(str_detect(payload, paste(c("create", "drop", "select"),collapse = '|')))

- penguin

1

通过这个例子，我得到了“创造者”（来自“创造者很好”），因为有“creat”，如何只匹配完全相同的单词？ - RxT

提醒一下，您需要在字符串中转义保留的正则表达式字符，例如将 "." 替换为 "\." 等。 - user2363777

0

我建议使用循环进行此类操作。在我看来，这样更加灵活多变。

一个示例是 httpdf 表格（也是为了回答 RxT 的评论）：

httpdf <- tibble(
  payload = c(
    "the createor is nice",
    "try to create something to select",
    "never catch a dropping knife",
    "drop it like it's hot",
    NA,
    "totaly unrelated" ),
  other_optional_columns = 1:6 )

我使用sapply循环遍历搜索查询，并将每个字符串作为单独的模式应用于str_detect。这将返回一个矩阵，其中每个搜索查询字符串对应一列，每个主题字符串对应一行，可以折叠以返回所需的逻辑向量。

queries1 <-
  httpdf[ 
    sapply(
      c("create", "drop", "select"),
      str_detect,
      string = httpdf$payload ) %>%
    rowSums( na.rm = TRUE ) != 0, ]

当然，它可以被包装在一个函数中，在 tidyverse 过滤器内使用：

## function
str_detect_mult <-
  function( subject, query ) {
    sapply(
      query,
      str_detect,
      string = subject ) %>%
    rowSums( na.rm = TRUE ) != 0
}
## tidy code
queries1 <- httpdf %>% filter( str_detect_mult( payload, c("create", "drop", "select") ) )

如果您想要精确匹配单词，可以轻松处理单词边界（“\\b”匹配单词边界并连接到字符串的开头和结尾）：

str_detect_mult_exact <-
  function( subject, query ) {
    sapply(
      query,
      function(.x)
        str_detect(
          subject,
          str_c("\\b",.x,"\\b") ) ) %>%
    rowSums( na.rm = TRUE ) != 0
}

轻松处理多个匹配项（例如，如果您只想要匹配恰好一个字符串的行，即异或）：

str_detect_mult_xor <-
  function( subject, query ) {
    sapply(
      query,
      str_detect,
      string = subject ) %>%
    rowSums( na.rm = TRUE ) == 1
}

在基本的R中也适用：

## function
str_detect_mult <-
  function( subject, query ) {
    rowSums(sapply(
      query,
      grepl,
      x = subject ), na.rm = TRUE ) != 0
}
## tidy code
queries1 <- httpdf[ str_detect_mult( httpdf$payload, c("create", "drop", "select") ), ]

- CramMasbür

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- fabilous · Accepted Answer

在我看来，对于您要查找的相当短的字符串列表，甚至更简单的方法可以是：

queries <- httpdf %>% filter(str_detect(payload, "create|drop|select"))

正如@penguin之前建议的那样，paste(c("create", "drop", "select"),collapse = '|'))实际上就是这样做的。

[...] paste(c("create", "drop", "select"),collapse = '|')) [...]

如果您有更长的字符串列表要检测，我建议先将单个字符串存储到向量中，然后使用@penguin的方法，例如：

strings <- c("string1", "string2", "string3", "string4", "string5", "string6")
queries <- httpdf %>% 
  filter(str_detect(payload, paste(strings, collapse = "|")))

这样做的优点是，如果你想要或必须后续使用向量 strings ，你可以很容易地实现。