使用dplyr和grepl结合过滤观测数据

Question

使用dplyr和grepl结合过滤观测数据

45

我正在尝试使用dplyr和grepl从一个大数据集中过滤一些观察值。如果有更优秀的解决方案，我并不固执于grepl。

以这个示例df为例：

df1 <- data.frame(fruit=c("apple", "orange", "xapple", "xorange", 
                          "applexx", "orangexx", "banxana", "appxxle"), group=c("A", "B") )
df1


#     fruit group
#1    apple     A
#2   orange     B
#3   xapple     A
#4  xorange     B
#5  applexx     A
#6 orangexx     B
#7  banxana     A
#8  appxxle     B

我想要：

筛选掉以'x'开头的情况
筛选掉以'xx'结尾的情况

我已经设法找出如何去除所有包含'x'或'xx'的内容，但没有办法只去除开头或结尾的内容。以下是如何去除内部所有包含'xx'的内容（而不仅仅是以 'xx' 结尾）的方法：

df1 %>%  filter(!grepl("xx",fruit))

#    fruit group
#1   apple     A
#2  orange     B
#3  xapple     A
#4 xorange     B
#5 banxana     A

根据我的观点，这明显是错误的过滤了'appxxle'。

我从未完全掌握正则表达式。我一直在尝试修改类似以下的代码：grepl("^(?!x).*$", df1$fruit, perl = TRUE)以使其在过滤命令中起作用，但我还没有完全理解它。

预期输出：

#      fruit group
#1     apple     A
#2    orange     B
#3   banxana     A
#4   appxxle     B

如果可能的话，我想在 dplyr 内部完成这个操作。

- jalapic

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Chase · Accepted Answer

我不明白你的第二个正则表达式，但这个更基本的正则表达式似乎可以解决问题：

df1 %>% filter(!grepl("^x|xx$", fruit))
###
    fruit group
1   apple     A
2  orange     B
3 banxana     A
4 appxxle     B

我想你应该知道，但这里根本不需要使用dplyr：

df1[!grepl("^x|xx$", df1$fruit), ]
###
    fruit group
1   apple     A
2  orange     B
7 banxana     A
8 appxxle     B

正则表达式是用于查找以x开头或以xx结尾的字符串。^和$是正则表达式锚点，分别表示字符串的开头和结尾。|是OR运算符。我们使用!来否定grepl的结果，这样我们就可以找到不匹配正则表达式的字符串。