递归正则表达式中的if-else未按预期工作

Question

递归正则表达式中的if-else未按预期工作

6

我正在使用正则表达式解析一些BBCode，所以这个正则表达式必须递归地工作，以便匹配其他标签中的标签。大多数BBCode都有一个参数，有时它被引用，但并非总是如此。

我正在使用一个简化版本的正则表达式（使用html样式标记来减少所需的转义），如下：

'~<(\")?a(?(1)\1)> #Match the tag, and require a closing quote if an opening one provided
  ([^<]+ | (?R))* #Match the contents of the tag, including recursively
</a>~x'

然而，如果我有一个看起来像这样的测试字符串：

<"a">Content<a>Also Content</a></a>

它只匹配<a>Also Content</a>，因为当它尝试从第一个标记开始匹配时，第一个匹配组\1被设置为"，并且在递归运行正则表达式以匹配内部标记时不会被覆盖，这意味着它没有引号，因此它不匹配，而该正则表达式失败。

如果我始终使用或不使用引号，它就可以正常工作，但我无法确定我需要解析的内容是否如此。有没有办法解决这个问题？

完整的正则表达式用于匹配[spoiler]content[/spoiler]，[spoiler=option]content[/spoiler]和[spoiler="option"]content[/spoiler]。

"~\[spoiler\s*+ #Match the opening tag
            (?:=\s*+(\"|\')?((?(1)(?!\\1).|[^\]]){0,100})(?(1)\\1))?+\s*\] #If an option exists, match that
          (?:\ *(?:\n|<br />))?+ #Get rid of an extra new line before the start of the content if necessary
          ((?:[^\[\n]++ #Capture all characters until the closing tag
            |\n(?!\[spoiler]) Capture new line separately so backtracking doesn't run away due to above
            |\[(?!/?spoiler(?:\s*=[^\]*])?) #Also match all tags that aren't spoilers
            |(?R))*+) #Allow the pattern to recurse - we also want to match spoilers inside spoilers,
                     # without messing up nesting
          \n? #Get rid of an extra new line before the closing tag if necessary
          \[/spoiler] #match the closing tag
         ~xi"

还有一些其他的错误。

- JackW

2个回答

1

(?(1)...)只是检查组1是否已定义，所以当第一次定义组时条件为真。这就是为什么你得到这个结果（与递归级别或其他无关）。

因此，当递归达到<a>时，正则表达式引擎尝试匹配<a">并失败。

如果想使用条件语句，可以写成<("?)a(?(1)\1)>。这样每次都会重新定义组1。

显然，你可以用更高效的方式编写你的模式，如下所示：

~<(?:a|"a")>[^<]*+(?:(?R)[^<]*)*+</a>~

针对您的问题，我将使用这种模式来匹配任何标签：

$pattern = <<<'EOD'
~
\[ (?<tag>\w+) \s*
(?: 
  = \s* 
  (?| " (?<option>[^"]*) " | ' ([^']*) ' | ([^]\s]*) ) # branch reset feature
)?
\s* ]
(?<content> [^[]*+ (?: (?R) [^[]*)*+ )
\[/\g{tag}]
~xi
EOD;

如果您想在基础级别上强制使用特定标签，可以在标签名称之前添加(?(R)|(?=spoiler\b))。

- Casimir et Hippolyte

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Lucas Trzesniewski · Accepted Answer

最简单的解决方案是使用替代品：

<(?:a|"a")>
  ([^<]++ | (?R))*
</a>

但是如果你不想重复那个 a 部分，你可以这样做：

<("?)a\1>
  ([^<]++ | (?R))*
</a>

演示

我刚刚把条件符号?放在了捕获组内。这个捕获组总是匹配，但匹配结果可以为空，因此条件符号不再必要。

顺便说一下：我对[^<]应用了占有量词，以避免灾难性回溯。

在你的情况下，我认为最好匹配通用标记而非特定标记。匹配所有标记，然后根据代码中的匹配结果进行处理。

以下是完整的正则表达式：

\[
  (?<tag>\w+) \s*
  (?:=\s*
    (?:
      (?<quote>["']) (?<arg>.{0,100}?) \k<quote>
      | (?<arg>[^\]]+)
    )
  )?
\]

(?<content>
  (?:[^[]++ | (?R) )*+
)

\[/\k<tag>\]

示例

请注意，我增加了J选项 (PCRE_DUPNAMES)，以便能够两次使用(?<arg>...)。