提取变量分隔符之间的文本

3

我有一段包含大量特殊字符的文本,我想从中提取某些子字符串:

y <- c("some stuff <rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep> some stuff <#> <dir> where is Londonderry?</dir>",
       "some stuff <rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>",
       "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa> blah blub <icu> Yeah </icu>")

我想提取标签之间的内容,比如子字符串 ...或 ... 或 ... 等:

使用这个正则表达式我有一定的成功:

library(stringr)
lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>(?!<\\1>).*</\\1>")), collapse = ", "))
[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>"

[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"
<2>, <rep>text</rep>, <rep>more text</rep>

仅仅[[2]]并不如期望的那样:还有不想要的内容(即<#> 可能还有其他东西),并且两个<rep> ...</rep>子字符串的出现没有被,分隔开。 我猜想我的正则表达式在这里失败了,因为这两个标签是相同的,而不是不同的。

如何改进正则表达式以获得这个预期结果

预期结果

<2>, <rep>文本</rep>, <rep>更多文本</rep>
[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"

[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

编辑:

与此同时,我已经找到了一个可行的解决方案:

lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>.*?</\\1>")), collapse = ", "))
3个回答

4

这个怎么样?

unlist(str_extract_all(y, "\\<([A-Za-z0-9_]+\\>).*?(\\<\\/\\1)"))

# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>" "<dir> where is Londonderry?</dir>"                         
# [3] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>"    "<rep> I 1lIved in Lisburn </rep>"                          
# [5] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>"    "<icu> Yeah </icu>"     

基本上我们要做的就是将(开始)标签的主体(加上尾随角括号)放入一个捕获组中,然后使用该捕获组来定义相应的结束标签。然后,我们捕获这两个捕获组之间的所有内容。所以像这样:<(tag>)whatever<\\1,其中\1tag>
lapply(str_extract_all(y, "\\<([A-Za-z0-9]+)\\>.*?\\<\\/\\1\\>"), paste, collapse = ", ")

# [[1]]
# [1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

# [[2]]
# [1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"

# [[3]]
# [1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

谢谢你。你的解决方案的新颖之处不在于反向引用的使用,这一点你已经详细解释了 - 在 OP 中也是以完全相同的方式使用的 - 而是懒惰点的使用。此外,没有必要使用全面的字符类 [A-Za-z0-9_][a-z] 就可以了。 - Chris Ruehlemann
@ChrisRuehlemann,请查看我对答案的编辑。 - Dunois
1
我看过了,但要让我接受它,你需要更多地编辑你的答案。此外:经过编辑的解决方案看起来非常像我的编辑解决方案(早先)。 - Chris Ruehlemann
@ChrisRuehlemann 嗯,只有那么多种方法可以取消列表和折叠一堆字符串。值得一提的是,我没有基于 OP 中的编辑(我现在才注意到)来制作这个答案。如果您不“接受”答案,那没关系,我并不是为了分数/声誉/任何东西而做这个。我认为我在某种程度上帮助了您(使用懒惰的正则表达式量词),并且我也在学习中获益,所以这对我来说已经足够了。 - Dunois
信不信由你,使用惰性量词的想法在我发布问题后不久就浮现了。无论如何,我会接受你的答案,这样你就能从中受益(我也会因为减少两个点而感到高兴)。 - Chris Ruehlemann

4
library(gsubfn)
a1 <- strapplyc(y, "<dir>(.*?)</dir>", simplify = c)
a2 <- strapplyc(y, "<rep>(.*?)</rep>", simplify = c)
a3 <- strapplyc(y, "<icu>(.*?)</icu>", simplify = c)

a1
a2
a3

# output:
> a1
[1] " where is Londonderry?"
> a2
[1] " I 1knOw 2LondondErry is bigger than 2LIsburn% " " <[> But it 's 1nOt an overflow of Belfast% "   
[3] " I 1lIved in Lisburn "                          
> a3
[1] " Yeah "

1
谢谢你的努力,但那不太是预期的输出。 - Chris Ruehlemann
感谢您的反馈! - TarJae

4

如果我正确理解了您的问题,这是一个可能的解决方案(我经常使用rebus包解决与正则表达式相关的问题 - 结果是传统的正则表达式):

library(dplyr)
library(rebus)
library(stringi)
library(purrr)

y <- c("some stuff <rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep> some stuff <#> <dir> where is Londonderry?</dir>",
       "some stuff <rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>",
       "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa> blah blub <icu> Yeah </icu>")

pattern <- "<" %R% ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% ">" %R% ".*?" %R% "<" %R% "/" %R% ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% ">" 

stringi::stri_extract_all_regex(y ,pattern, simplify = FALSE) %>% 
  purrr::map(~paste0(.x, collapse = ", "))

[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"

[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"

[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接