使用包含多个模式的字符向量进行grep操作

Question

使用包含多个模式的字符向量进行grep操作

180

我正在尝试使用grep来测试一个字符串向量是否存在于另一个向量中，并输出那些存在的值（匹配的模式）。

我有一个类似这样的数据框：

FirstName Letter   
Alex      A1
Alex      A6
Alex      A7
Bob       A1
Chris     A9
Chris     A6

我有一个字符串向量patterns，其中包含需要在“Letter”列中查找的字符串，例如：c("A1", "A9", "A6")。

我想检查模式向量中的任何字符串是否出现在“Letter”列中。如果是，则希望输出唯一值。

问题是，我不知道如何使用grep来匹配多个模式。我尝试过：

matches <- unique (
    grep("A1| A9 | A6", myfile$Letter, value=TRUE, fixed=TRUE)
)

但是它给了我0个匹配，这是不正确的，有什么建议吗？

- user971102

3

由于您的模式是真正的正则表达式，因此您无法使用fixed=TRUE。 - Marek

6

使用match、%in%甚至==是比较精准匹配的唯一正确方式。正则表达式在这种任务中非常危险，可能导致意外结果。 - David Arenburg

11个回答

46

很好的答案，不过不要忘记来自dplyr的filter()函数：

patterns <- c("A1", "A9", "A6")
>your_df
  FirstName Letter
1      Alex     A1
2      Alex     A6
3      Alex     A7
4       Bob     A1
5     Chris     A9
6     Chris     A6

result <- filter(your_df, grepl(paste(patterns, collapse="|"), Letter))

>result
  FirstName Letter
1      Alex     A1
2      Alex     A6
3       Bob     A1
4     Chris     A9
5     Chris     A6

- Adamm

3

我认为grepl一次只能使用一个模式（需要长度为1的向量），但我们有3个模式（长度为3的向量），因此可以使用友好的|分隔符将它们结合在一起，尝试您运气吧 :) - Adamm

3

我明白了，这是一种压缩输出类似于A1 | A2的方法，所以如果想要所有条件，则折叠应该使用&符号，很棒，谢谢。 - Ahdee

1

你好，使用 )|( 来分隔模式可能会使其更加健壮：paste0("(", paste(patterns, collapse=")|("),")")。不幸的是，这样也稍微有些不够优雅。这导致了模式 (A1)|(A9)|(A6)。 - fabern

41

这应该可以工作：

grep(pattern = 'A1|A9|A6', x = myfile$Letter)

甚至更简单的是：

library(data.table)
myfile$Letter %like% 'A1|A9|A6'

- BOC

13

%like% 不是基础 R 语言函数，因此您需要说明使用它需要哪些软件包。 - Gregor Thomas

2

对于其他查看此答案的人，%like% 是 data.table 包的一部分。在 data.table 中类似的还有 like(...)、%ilike% 和 %flike%。 - steveb

10

根据Brian Digg的文章，这里有两个有用的函数可以用于过滤列表：

#Returns all items in a list that are not contained in toMatch
#toMatch can be a single item or a list of items
exclude <- function (theList, toMatch){
  return(setdiff(theList,include(theList,toMatch)))
}

#Returns all items in a list that ARE contained in toMatch
#toMatch can be a single item or a list of items
include <- function (theList, toMatch){
  matches <- unique (grep(paste(toMatch,collapse="|"), 
                          theList, value=TRUE))
  return(matches)
}

- Austin

6

您尝试过使用match()或charmatch()函数吗？

以下是使用示例：

match(c("A1", "A9", "A6"), myfile$Letter)

- user3877096

4

需要注意的一点是，match 不使用模式匹配，而是期望进行精确匹配。 - steveb

5

补充Brian Diggs的回答，使用grepl的另一种方法是返回包含所有值的数据框。

toMatch <- myfile$Letter

matches <- myfile[grepl(paste(toMatch, collapse="|"), myfile$Letter), ]

matches

Letter Firstname
1     A1      Alex 
2     A6      Alex 
4     A1       Bob 
5     A9     Chris 
6     A6     Chris

也许更加简洁一些……也许？

- DryLabRebel

4

不确定此答案是否已经出现...

对于问题中的特定模式，您只需使用单个 grep() 调用即可解决。

grep("A[169]", myfile$Letter)

- Assaf

2

去掉空格。所以要这样做：

matches <- unique(grep("A1|A9|A6", myfile$Letter, value=TRUE, fixed=TRUE))

- user9325029

2

使用 sapply。

 patterns <- c("A1", "A9", "A6")
         df <- data.frame(name=c("A","Ale","Al","lex","x"),Letters=c("A1","A2","A9","A1","A9"))



   name Letters
1    A      A1
2  Ale      A2
3   Al      A9
4  lex      A1
5    x      A9


 df[unlist(sapply(patterns, grep, df$Letters, USE.NAMES = F)), ]
  name Letters
1    A      A1
4  lex      A1
3   Al      A9
5    x      A9

- dondapati

0

另一个选择是使用类似于'\\b(A1|A9|A6)\\b'的语法作为模式。这是用于正则表达式单词边界的，如果Bob有例如"A7，A1"的字母时，使用该语法仍然可以提取行。以下是两个选项的可重现示例：

df <- read.table(text="FirstName Letter   
Alex      A1
Alex      A6
Alex      A7
Bob       A1
Chris     A9
Chris     A6", header = TRUE)
df
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex     A7
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6
with(df, df[grep('\\b(A1|A9|A6)\\b', Letter),])
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6

df2 <- read.table(text="FirstName Letter   
Alex      A1
Alex      A6
Alex      A7,A1
Bob       A1
Chris     A9
Chris     A6", header = TRUE)
df2
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex  A7,A1
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6
with(df2, df2[grep('A1|A9|A6', Letter),])
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex  A7,A1
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6

^{创建于2022年7月16日，使用reprex包（v2.0.1）}

请注意：如果您正在使用R v4.1+，可以使用\\b，否则请使用\b。

- Quinten

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Brian Diggs · Accepted Answer

除了@Marek的评论不包括fixed==TRUE之外，您还需要在正则表达式中去除空格。应该是"A1|A9|A6"。

您还提到有很多模式。假设它们在一个向量中。

toMatch <- c("A1", "A9", "A6")

那么你可以直接使用paste和collapse = "|"创建你的正则表达式。

matches <- unique (grep(paste(toMatch,collapse="|"), 
                        myfile$Letter, value=TRUE))