如何使用R通过反向模式提取子字符串？

Question

如何使用R通过反向模式提取子字符串？

6

我尝试使用gsub() R函数通过模式提取子字符串。

# Example: extracting "7 years" substring.
string <- "Psychologist - 7 years on the website, online"
gsub(pattern="[0-9]+\\s+\\w+", replacement="", string)`

`[1] "Psychologist -  on the website, online"

正如您所看到的，使用gsub()很容易排除所需的子字符串，但我需要反转结果并仅得到"7 years"。我考虑使用"^"之类的东西：

gsub(pattern="[^[0-9]+\\s+\\w+]", replacement="", string)

请问是否有人能帮我提供正确的正则表达式模式？

- Michael

各位，你们能否解释一下为什么在'replacement="\1"'中使用"\1"？ - Michael

2个回答

4

你可以在 R 中使用与 \d 相反的正则表达式，即 \D。

string <- "Psychologist - 7 years on the website, online"
sub(pattern = "\\D*(\\d+\\s+\\w+).*", replacement = "\\1", string)
# [1] "7 years"

\D* 的意思是：尽可能匹配不包含数字的字符，剩余的字符将被捕获到一个组中并替换整个字符串。

请参见regex101.com上的演示。

- Jan

谢谢。好的解决方案。 - Michael

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wiktor Stribiżew · Accepted Answer

您可以使用

sub(pattern=".*?([0-9]+\\s+\\w+).*", replacement="\\1", string)

请参见此R演示。

细节

.*? - 任意0个或多个字符，尽可能少
([0-9]+\\s+\\w+) - 捕获组1：
- [0-9]+ - 一个或多个数字
- \\s+ - 1个或多个空格
- \\w+ - 1个或多个单词字符
.* - 字符串的其余部分（任意0个或多个字符，尽可能多）

替换中的\1将被捕获组1的内容替换。