这里有两个关键点:
- 要从字符串中删除的模式可能会重叠
- 可能有多个非重叠的模式需要从字符串中删除
下面的解决方案尝试使用我最喜欢的工具来解决这两个问题。
library(data.table)
setDT(dat)[, rn := .I]
library(stringr)
library(magrittr)
pos <-
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as.data.table) %>%
rbindlist(idcol = "rn")) %>%
rbindlist() %>%
setorder(rn, start, end) %>%
.[, grp := cumsum(cummax(shift(end, fill = 0)) < start), by = rn] %>%
.[, .(start = min(start), end = max(end)), by = .(rn, grp)]
现在,
pos
已经变成:
rn grp start end
1: 1 1 6 18
2: 2 1 10 25
3: 3 1 1 13
4: 5 1 6 10
5: 5 2 24 28
6: 6 1 1 13
7: 6 2 15 27
8: 7 1 3 7
9: 8 1 1 10
10: 8 2 12 16
11: 8 3 22 34
12: 9 1 1 10
13: 9 2 19 31
dat[, short_x := x]
for (g in rev(seq_len(max(pos$grp)))) {
dat[pos[grp == g], on = .(rn), short_x := `str_sub<-`(short_x, start, end, value = "")]
}
dat[, rn := NULL][
, short_x := str_squish(short_x)][]
x some_other_cols short_x
1: this is my example text 1 this text
2: and here is my other text example 2 and here example
3: my other text is short 2 is short
4: yet another text 4 yet another text
5: this is my text where 'is my' appears twice 5 this text where '' appears twice
6: my other text is my example 6
7: This myself 7 Thself
8: my example is my not my other text 8 not
9: my example is not my other text 9 is not
合并重叠位置的代码改编自此答案。
中间结果
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as.data.table) %>%
rbindlist(idcol = "rn"))
[[1]]
rn start end
1: 1 9 18
2: 6 18 27
3: 8 1 10
4: 9 1 10
[[2]]
rn start end
1: 1 6 10
2: 2 10 14
3: 5 6 10
4: 5 24 28
5: 6 15 19
6: 7 3 7
7: 8 12 16
[[3]]
rn start end
1: 2 13 25
2: 3 1 13
3: 6 1 13
4: 8 22 34
5: 9 19 31
该段文字表明模式1和2在第1行重叠,模式2和3在第2行重叠。第5、8和9行具有不重叠的模式。第7行旨在表明无论单词边界如何,都会提取模式。
编辑:dplyr
版本
原帖作者已经提到他/她“成功避开了data.table”。因此,我感到挑战,添加了一个dplyr
版本:
library(dplyr)
library(stringr)
pos <-
lapply(my_patterns, function(pat) str_locate_all(dat$x, pat) %>%
lapply(as_tibble) %>%
bind_rows(.id = "rn")) %>%
bind_rows() %>%
arrange(rn, start, end) %>%
group_by(rn) %>%
mutate(grp = cumsum(cummax(lag(end, default = 0)) < start)) %>%
group_by(rn, grp) %>%
summarize(start = min(start), end = max(end))
dat <- dat %>%
mutate(rn = row_number() %>% as.character(),
short_x = x %>% as.character())
for (g in rev(seq_len(max(pos$grp)))) {
dat <- dat %>%
left_join(pos %>% filter(grp == g), by = "rn") %>%
mutate(short_x = ifelse(is.na(grp), short_x, `str_sub<-`(short_x, start, end, value = ""))) %>%
select(-grp, -start, -end)
}
dat %>%
select(-rn) %>%
mutate(short_x = str_squish(short_x))
x some_other_cols short_x
1 this is my example text 1 this text
2 and here is my other text example 2 and here example
3 my other text is short 2 is short
4 yet another text 4 yet another text
5 this is my text where 'is my' appears twice 5 this text where '' appears twice
6 my other text is my example 6
7 This is myself 7 This self
8 my example is my not my other text 8 not
9 my example is not my other text 9 is not
算法本质上是相同的。然而,在这里,
dplyr
与
data.table
有两个挑战:
dplyr
需要明确从 factor
转换为 character
dplyr
中没有可用的 update join,因此,比起 data.table
的对应部分,for
循环变得更加冗长(也许,有人知道一种花哨的 purrr
函数或 map-reduce 技巧来完成相同的任务?)
编辑2
以上代码有一些漏洞修复和改进:
- 折叠位置已被更正,现在可以处理我添加到
dat
中的某些边缘情况。
seq()
已被替换为 seq_len()
。
str_squish()
可以减少字符串内重复的空格,并删除字符串开头和结尾的空格。
数据
我添加了一些用例来测试非重叠模式和完全删除,例如:
dat <- data.frame(
x = c(
"this is my example text",
"and here is my other text example",
"my other text is short",
"yet another text",
"this is my text where 'is my' appears twice",
"my other text is my example",
"This myself",
"my example is my not my other text",
"my example is not my other text"
),
some_other_cols = c(1, 2, 2, 4, 5, 6, 7, 8, 9)
)
my_patterns <- c("my example", "is my", "my other text")
this text
?如果根据你的第一个模式消除了 my example,那么你就剩下了this is text
,它没有单词 my...所以 is 是如何被消除的? - Sotosc("is my example", "is my other text", "my example", "is my", "my other text")
)。 - Georgerymatches <- lapply(my_patterns,function(x) regexpr(x,dat$x))
,接下来你需要计算每个匹配的开始和结尾位置(start+length),然后判断它们是否重叠以创建一个单一的“范围”来从字符串中移除。 - Tensibai