从字符串中提取两个单词之间的子字符串

Question

从字符串中提取两个单词之间的子字符串

7

我有以下字符串：

string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"

我想提取两个 <body> 标签之间的字符串。我要找的结果是：

substring = "<body>Iwant\to+extr@ctth!sstr|ng<body>"

请注意，两个<body>标签之间的子字符串可以包含字母、数字、标点符号和特殊字符。

有没有简单的方法来做到这一点？

- Mayou

也许是这个 <body>[\S\s]*<body>。 - user557597

4个回答

6

regex = '<body>.+?<body>'

你需要使用非贪婪模式 (.+?)，这样它就不会尽可能地匹配更多的 <body> 标签。

如果你仅使用正则表达式且没有辅助函数，你需要使用一个捕获组来提取所需内容，即：

regex = '(<body>.+?<body>)'

- Steve P.

2

使用strsplit()函数可以帮助您：

>string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"
>x = strsplit(string, '<body>', fixed = FALSE, perl = FALSE, useBytes = FALSE)
[[1]]
[1] "asflkjsdhlkjsdhglk"         "Iwant\to+extr@ctth!sstr|ng" "sdgdfsghsghsgh"  
> x[[1]][2]
[1] "Iwant\to+extr@ctth!sstr|ng"

当然，这将为您提供字符串的所有三个部分，并且不包括标签。

- Stu

非常感谢。但是您的解决方案中的body标签被排除在外了。我也想把它们返回。 - Mayou

0

我相信Matthew和Steve的答案都是可以接受的。这里有另一个解决方案：

string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"

regmatches(string, regexpr('<body>.+<body>', string))

output = sub(".*(<body>.+<body>).*", "\\1", string)

print (output)

- Gardener

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Matthew Plourde · Accepted Answer

7

这是正则表达式的方法：

regmatches(string, regexpr('<body>.+<body>', string))

- Matthew Plourde

为什么你需要在这里使用 perl = TRUE？ - TheComeOnMan

@Codoremifa，不用了，谢谢。最初，我以为OP想要排除标签，于是建议使用前瞻断言，并需要使用 perl=TRUE 标志。 - Matthew Plourde

1

perl=TRUE 的一个优点是速度更快。 - Arun

@Arun 不开玩笑，谢谢，我不知道这个。 - Matthew Plourde