在R中使用正则表达式删除SQL注释，同时保留特殊标记。

Question

在R中使用正则表达式删除SQL注释，同时保留特殊标记。

3

我需要在R中从SQL语句的字符串中删除以--开头的注释。我尝试使用正则表达式和gsub（但我也愿意接受其他建议）来完成这个任务。复杂的是，这些字符串可能包含以--->>>开头并以<<<---结尾的特殊标记，我需要保留它们以进行进一步处理。

使用前瞻和后顾已经取得了一些关于那些在行首开始的注释/标记的进展：

> re <- "^(?!=<<<-){1}--(?!->>>){1}.*$"
>
> gsub(re, "", "-- test", perl=TRUE)      # should be ""
[1] ""
> gsub(re, "", "--->>> test", perl=TRUE)   # should be "--->>> test"
[1] "--->>> test"
> gsub(re, "", "<<<--- test", perl=TRUE)   # should be "<<<--- test"
[1] "<<<--- test"
> gsub(re, "", "--->>>->>> test", perl=TRUE) # should be --->>>->>> test
[1] "--->>>->>> test"
> gsub(re, "", "---->>> test", perl=TRUE)    # should be ""
[1] ""
> gsub(re, "", "test --->>> test", perl=TRUE) # should be "test --->>> test"
[1] "test --->>> test"
> gsub(re, "", "test --->>> test <<<---", perl=TRUE) # should be "test --->>> test <<<---"
[1] "test --->>> test <<<---"

但显然，这不能处理字符串中其他地方的注释：

> gsub(re, "", "test1 -- test", perl=TRUE) # should be "test1"
[1] "test1 -- test"  # WRONG

在正则表达式开头删除^会破坏大多数测试用例：

> re <- "(?!=<<<-){1}--(?!->>>){1}.*$"
> gsub(re, "", "-- test", perl=TRUE)      # should be ""
[1] ""
> gsub(re, "", "test1 -- test", perl=TRUE) # should be "test1"
[1] "test1 "
> gsub(re, "", "--->>> test", perl=TRUE)   # should be "--->>> test"
[1] "-"  # WRONG
> gsub(re, "", "<<<--- test", perl=TRUE)   # should be "<<<--- test"
[1] "<<<"  # WRONG
> gsub(re, "", "--->>>->>> test", perl=TRUE) # should be --->>>->>> test
[1] "-"  # WRONG
> gsub(re, "", "---->>> test", perl=TRUE)    # should be ""
[1] ""
> gsub(re, "", "test --->>> test", perl=TRUE) # should be "test --->>> test"
[1] "test -"  # WRONG
> gsub(re, "", "test --->>> test <<<---", perl=TRUE) # should be "test --->>> test <<<---"
[1] "test -"  # WRONG

有人有关于如何实现这个的建议吗？我接受任何建议，但是必须使用R，并且必须保留特殊标签--->>>和<<<---。

编辑

如评论中所述，这也是一个测试用例：

> gsub(re, "", "-->>> test", perl=TRUE)    # should be ""

- Jason Morgan

为什么 "--->>>->>> test" 没有被移除，而 "---->>> test" 应该被移除？ - Avinash Raj

@AvinashRaj 因为第一个情况有三个“-”，而第二个情况有四个“-”。这是一种有点奇怪的情况，所以如果我无法修复它，我也不会太担心。 - Jason Morgan

尝试使用 "(?<!-)--->>>.*?<<<---(?!-)(*SKIP)(*F)|(?<!-)--(?!-).*"。标签总是在一行上吗？还是可能会有换行符？ - Wiktor Stribiżew

(?:^|\w)\s*--\s*\b.*|\s*-{4,}.* - Avinash Raj

对不起，我以为你在同一个字符串中有开头和结尾部分。请使用 (?:(?<!-)--->>>|<<<---(?!-))(*SKIP)(*FAIL)|--.* 尝试一下。然而，此时我怀疑它对你帮助不大。 - Wiktor Stribiżew

显示剩余2条评论

2个回答

1

这应该可以完成工作：

(?:(?<=[^-<]|^)--(?=[^->])|-{4,}).*|(?<!-)-->>>.*

在regex101上的演示

- Thomas Ayoub

@WiktorStribiżew 应该是的。 - Jason Morgan

1

除非你格式化正则表达式并解释其匹配内容，否则它将显得很丑陋。 - Wiktor Stribiżew

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wiktor Stribiżew · Accepted Answer

我将发布我在评论中分享的表达式，因为它很有帮助。其背后的想法是我们可以匹配特定的子字符串，然后从匹配中丢弃它们，并仅使用模式的(*SKIP)(*FAIL)动词后面的部分来匹配和保留我们想要删除的内容。

使用：

(?:(?<!-)--->>>|<<<---(?!-))(*SKIP)(*FAIL)|--.*

我们匹配以下内容：

- (?:(?<!-)--->>>|<<<---(?!-))(*SKIP)(*FAIL) - 两个序列之一： - (?<!-)--->>> - 前面没有-的--->>> - | - 或 - <<<---(?!-) - 后面没有-的<<<--- - (*SKIP)(*FAIL) - 放弃到目前为止匹配的内容并继续查找下一个匹配项。 - | - 或 - --.* - 2个连字符后跟0个或多个非换行符字符。

请参见regex101.com上的正则表达式演示。

请注意，专用解析器会更安全。