我有一段包含大量特殊字符的文本,我想从中提取某些子字符串:
y <- c("some stuff <rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep> some stuff <#> <dir> where is Londonderry?</dir>",
"some stuff <rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>",
"<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa> blah blub <icu> Yeah </icu>")
我想提取标签之间的内容,比如子字符串
使用这个正则表达式我有一定的成功:
library(stringr)
lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>(?!<\\1>).*</\\1>")), collapse = ", "))
[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"
[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep> <#> potentially more stuff <rep> I 1lIved in Lisburn </rep>"
[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"
<2>, <rep>text</rep>, <rep>more text</rep>
仅仅[[2]]
并不如期望的那样:还有不想要的内容(即<#> 可能还有其他东西
),并且两个<rep> ...</rep>
子字符串的出现没有被,
分隔开。 我猜想我的正则表达式在这里失败了,因为这两个标签是相同的,而不是不同的。
如何改进正则表达式以获得这个预期结果:
预期结果:
<2>, <rep>文本</rep>, <rep>更多文本</rep>
[[1]]
[1] "<rep> I 1knOw 2LondondErry is bigger than 2LIsburn% </rep>, <dir> where is Londonderry?</dir>"
[[2]]
[1] "<rep> <[> But it 's 1nOt an overflow of Belfast% </rep>, <rep> I 1lIved in Lisburn </rep>"
[[3]]
[1] "<xpa> <[> <unclear> 3 sylls </unclear> </[> </{> </xpa>, <icu> Yeah </icu>"
编辑:
与此同时,我已经找到了一个可行的解决方案:
lapply(y, function(x) paste0(unlist(str_extract_all(x, "<([a-z]{3})>.*?</\\1>")), collapse = ", "))
[A-Za-z0-9_]
,[a-z]
就可以了。 - Chris Ruehlemann