为什么这个递归正则表达式只有在一个字符重复2^n-1次时才匹配?

6
阅读了polygenelubricants的一系列关于高级正则表达式技巧的文章(尤其是这个Java正则表达式如何检测回文?),我决定尝试使用递归(在PHP中)创建自己的PCRE正则表达式来解析回文。
我想到的是:
^(([a-z])(?1)\2|[a-z]?)$

我的理解是,这个表达式应该匹配零个或一个字符(小于2个字符的每个字符串都隐含为回文,并且在递归中考虑奇数长度的回文),或者两个相同字符由模式的递归分隔。不幸的是,它似乎并没有按照这种方式工作,正如您在www.ideone.com/a9T3F上看到的那样。相反,只有2 n - 1(即空字符串、aaaaaaaaaaaa15)重复字符与正则表达式匹配。
奇怪的是,如果我修改模式,使递归变成可选项(即^(([a-z])(?1)?\2|[a-z]?)$,请参见www.ideone.com/D6lJR),它只匹配字符重复2n次的字符串(即空字符串,aaaaaaaaaaaaaaaa16)。

为什么我的正则表达式没有按照我的期望工作呢?

对于那些迫不及待想要建议不要使用正则表达式的人:
本问题的重点是学习如何正确使用递归正则表达式。我知道这不是确定一个字符串是否为回文的有效方法,如果我不得不在生产代码中确定回文,我也不会使用递归正则表达式;我只是对学习正则表达式的高级方面感兴趣。

2个回答

9
你所观察到的现象是由于PCRE子模式递归是原子性的,与Perl不同。实际上手册详细介绍了这个问题。

In PCRE (like Python, but unlike Perl), a recursive subpattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure.

This can be illustrated by the following pattern, which purports to match a palindromic string that contains an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):

    ^(.|(.)(?1)\2)$

The idea is that it either matches a single character, or two identical characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE it does not if the pattern is longer than three characters.

Consider the subject string "abcba":

At the top level, the first character is matched, but as it is not at the end of the string, the first alternative fails; the second alternative is taken and the recursion kicks in. The recursive call to subpattern 1 successfully matches the next character ("b"). (Note that the beginning and end of line tests are not part of the recursion).

Back at the top level, the next character ("c") is compared with what subpattern 2 matched, which was "a". This fails. Because the recursion is treated as an atomic group, there are now no backtracking points, and so the entire match fails. (Perl is able, at this point, to re- enter the recursion and try the second alternative.) However, if the pattern is written with the alternatives in the other order, things are different:

    ^((.)(?1)\2|.)$

This time, the recursing alternative is tried first, and continues to recurse until it runs out of characters, at which point the recursion fails. But this time we do have another alternative to try at the higher level. That is the big difference: in the previous case the remaining alternative is at a deeper recursion level, which PCRE cannot use.

To change the pattern so that matches all palindromic strings, not just those with an odd number of characters, it is tempting to change the pattern to this:

    ^((.)(?1)\2|.?)$

Again, this works in Perl, but not in PCRE, and for the same reason. When a deeper recursion has matched a single character, it cannot be entered again in order to match an empty string. The solution is to separate the two cases, and write out the odd and even cases as alternatives at the higher level:

    ^(?:((.)(?1)\2|)|((.)(?3)\4|.))$

WARNING!!!

The palindrome-matching patterns above work only if the subject string does not start with a palindrome that is shorter than the entire string. For example, although "abcba" is correctly matched, if the subject is "ababa", PCRE finds the palindrome "aba" at the start, then fails at top level because the end of the string does not follow. Once again, it cannot jump back into the recursion to try other alternatives, so the entire match fails.

其他参考资料


仔细观察模式

原子性论点是正确的,但也许不太明显它如何解释模式为什么会表现出观察到的行为。让我们仔细观察并看看这些如何契合:

我们将使用第一个模式:

^(([a-z])(?1)\2|[a-z]?)$

我将使用以下符号来表示递归:
  • 1 表示该字符被捕获到第一次备选项的第二组中
  • 2 表示该字符被第二个备选项匹配
    • 或者如果 2 不在一个字符上方,则执行零重复选项 ?
  • \ 表示该字符由对第一备选项中第二组的反向引用匹配
  • _ 表示递归分支的底部
    • 即使有其他备选项,该分支也不会重新进入!
现在我们以 "aaa" 作为输入进行考虑:
      _
1 1 1 2 
a a a   # This is the first bottom of the recursion,
        # now we go back to the third 1 and try to match \.
        # This fails, so the third 1 becomes 2.
    _
1 1 2
a a a   # Now we go back to the second 1 and try to match \.
        # This fails, so the second 1 becomes 2.
  _
1 2
a a a   # The second level matched! now we go back to the first level...

_____
1 2 \
a a a   # Now the first 1 can match \, and entire pattern matches!!

现在考虑字符串 "aaaaa"
          _
1 1 1 1 1 2
a a a a a  # Fifth 1 can't match \, so it becomes 2. 
        _
1 1 1 1 2
a a a a a  # Fourth 1 can't match \, so it becomes 2.
    _____
1 1 1 2 /
a a a a a  # Here's a crucial point. The third 1 successfully matched.
           # Now we're back to the second 1 and try to match \, but this fails.
           # However, since PCRE recursion is atomic, the third 1 will NOT be
           # reentered to try 2. Instead, we try 2 on the second 1.
_____
1 2 \
a a a a a  # Anchors don't match, so the first 1 becomes 2, and then also the
           # anchors don't match, so the pattern fails to match.

请注意,一旦递归级别与第一个替代项匹配,将来将不会尝试第二个替代项(即使这样做可能会导致匹配),因为PCRE子模式递归是原子的。


现在考虑"aa"

    _
1 1 2 
a a
  _
1 2
a a  # The second level matched by taking the one repetition option on ?.
     # We now go back to the first level, and we can't match \.
     # Since PCRE recursion is atomic, we can't go back to the second level
     # to try the zero repetition option on ?.
_    
2
a a  # Anchors don't match, trying zero option on ? also doesn't help,
     # so the pattern fails to match!

请注意,一旦递归级别在第二个替代方案上匹配了一个?的重复,将来将不会尝试零重复选项(即使这样做可能会导致匹配),因为PCRE子模式递归是原子的。
现在让我们考虑aaaaaaa
              _
1 1 1 1 1 1 1 2  
a a a a a a a 
            _
1 1 1 1 1 1 2  
a a a a a a a 
        _____
1 1 1 1 1 2 \  
a a a a a a a  # A crucial point: the fifth level matched and now the fourth
               # level can't match \, but it does NOT reenter the fifth level to
               # try 2. Instead, the fourth level tries 2.
    _____    
1 1 1 2 \  
a a a a a a a 
  _________    
1 1 1 2 \ \ 
a a a a a a a 
_____________    
1 1 1 2 \ \ \
a a a a a a a  # Entire pattern is a match! 

请注意,尽管PCRE子模式递归是原子性的,但它仍然可以成功匹配由字符重复2n-1次组成的回文字符串。
现在,只是为了好玩,让我们尝试一下"abcba":
          _
1 1 1 1 1 2
a b c b a
        _
1 1 1 1 2
a b c b a

1 1 1 2 
a b c b a   # Third level attempts \, but c does not match a!
            # So we go back to third 1 and try 2.
  _____
1 1 2 \ 
a b c b a 
_________
1 1 2 \ \
a b c b a   # Entire pattern is a match!

也就是说,这个模式不仅匹配“只有当字符重复 2n-1 次时才匹配”。它确实可以匹配 "abcba"(如ideone.com 上所见)。但是,它不能匹配 "ababa",也不能匹配 "aaaaa"(请参阅 man 页面上的警告!),因为 PCRE 中的子模式递归是原子的。
您可以将此跟踪过程应用于解释模式在任何输入上的行为。

你总是附带免费电子书来解答问题吗?解释得非常好!;) - Caspar Kleijne
非常感谢您提供这个出色的答案!在看到您的帖子之前,我认为自己对正则表达式已经了解很多了 ;) - Daniel Vandersluis
1
@Daniel:请参见https://dev59.com/MW865IYBdhLWcg3wkfgW。 - polygenelubricants

2

如果你需要一个完全可用的PCRE表达式来匹配回文,你可以使用以下表达式:

/^(?:(.)(?=.*(\1(?(2)\2))$))*+.?\2?$/


注意:请勿删除HTML标签。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接