你所观察到的现象是由于PCRE子模式递归是原子性的,与Perl不同。实际上
手册详细介绍了这个问题。
In PCRE (like Python, but unlike Perl), a recursive subpattern call is
always treated as an atomic group. That is, once it has matched some of
the subject string, it is never re-entered, even if it contains untried
alternatives and there is a subsequent matching failure.
This can be illustrated by the following pattern, which purports to match a palindromic string that contains an odd number of characters (for example,
"a"
, "aba"
, "abcba"
, "abcdcba"
):
^(.|(.)(?1)\2)$
The idea is that it either matches a single character, or two identical
characters surrounding a sub-palindrome. In Perl, this pattern works;
in PCRE it does not if the pattern is longer than three characters.
Consider the subject string "abcba"
:
At the top level, the first character is matched, but as it is not at
the end of the string, the first alternative fails; the second alternative
is taken and the recursion kicks in. The recursive call to subpattern 1
successfully matches the next character ("b"
). (Note that the beginning
and end of line tests are not part of the recursion).
Back at the top level, the next character ("c"
) is compared with what
subpattern 2 matched, which was "a"
. This fails. Because the recursion
is treated as an atomic group, there are now no backtracking points,
and so the entire match fails. (Perl is able, at this point, to re-
enter the recursion and try the second alternative.) However, if the
pattern is written with the alternatives in the other order, things are
different:
^((.)(?1)\2|.)$
This time, the recursing alternative is tried first, and continues to
recurse until it runs out of characters, at which point the recursion
fails. But this time we do have another alternative to try at the
higher level. That is the big difference: in the previous case the
remaining alternative is at a deeper recursion level, which PCRE cannot use.
To change the pattern so that matches all palindromic strings, not just
those with an odd number of characters, it is tempting to change the
pattern to this:
^((.)(?1)\2|.?)$
Again, this works in Perl, but not in PCRE, and for the same reason.
When a deeper recursion has matched a single character, it cannot be
entered again in order to match an empty string. The solution is to
separate the two cases, and write out the odd and even cases as alternatives
at the higher level:
^(?:((.)(?1)\2|)|((.)(?3)\4|.))$
WARNING!!!
The palindrome-matching patterns above work only if the subject string does not start with a palindrome that is shorter than the
entire string. For example, although "abcba"
is correctly matched, if
the subject is "ababa"
, PCRE finds the palindrome "aba"
at the start,
then fails at top level because the end of the string does not follow.
Once again, it cannot jump back into the recursion to try other alternatives,
so the entire match fails.
其他参考资料
仔细观察模式
原子性论点是正确的,但也许不太明显它如何解释模式为什么会表现出观察到的行为。让我们仔细观察并看看这些如何契合:
我们将使用第一个模式:
^(([a-z])(?1)\2|[a-z]?)$
我将使用以下符号来表示递归:
1
表示该字符被捕获到第一次备选项的第二组中
2
表示该字符被第二个备选项匹配
- 或者如果
2
不在一个字符上方,则执行零重复选项 ?
\
表示该字符由对第一备选项中第二组的反向引用匹配
_
表示递归分支的底部
现在我们以
"aaa"
作为输入进行考虑:
_
1 1 1 2
a a a # This is the first bottom of the recursion,
# now we go back to the third 1 and try to match \.
# This fails, so the third 1 becomes 2.
_
1 1 2
a a a # Now we go back to the second 1 and try to match \.
# This fails, so the second 1 becomes 2.
_
1 2
a a a # The second level matched! now we go back to the first level...
_____
1 2 \
a a a # Now the first 1 can match \, and entire pattern matches!!
现在考虑字符串
"aaaaa"
:
_
1 1 1 1 1 2
a a a a a
_
1 1 1 1 2
a a a a a
_____
1 1 1 2 /
a a a a a
_____
1 2 \
a a a a a
请注意,一旦递归级别与第一个替代项匹配,将来将不会尝试第二个替代项(即使这样做可能会导致匹配),因为PCRE子模式递归是原子的。
现在考虑"aa"
:
_
1 1 2
a a
_
1 2
a a # The second level matched by taking the one repetition option on ?.
# We now go back to the first level, and we can't match \.
# Since PCRE recursion is atomic, we can't go back to the second level
# to try the zero repetition option on ?.
_
2
a a # Anchors don't match, trying zero option on ? also doesn't help,
# so the pattern fails to match!
请注意,一旦递归级别在第二个替代方案上匹配了一个
?
的重复,将来将不会尝试零重复选项(即使这样做可能会导致匹配),因为PCRE子模式递归是原子的。
现在让我们考虑
aaaaaaa
。
_
1 1 1 1 1 1 1 2
a a a a a a a
_
1 1 1 1 1 1 2
a a a a a a a
_____
1 1 1 1 1 2 \
a a a a a a a
_____
1 1 1 2 \
a a a a a a a
_________
1 1 1 2 \ \
a a a a a a a
_____________
1 1 1 2 \ \ \
a a a a a a a
请注意,尽管PCRE子模式递归是原子性的,但它仍然可以成功匹配由字符重复2
n-1次组成的回文字符串。
现在,只是为了好玩,让我们尝试一下"abcba":
_
1 1 1 1 1 2
a b c b a
_
1 1 1 1 2
a b c b a
1 1 1 2
a b c b a # Third level attempts \, but c does not match a!
# So we go back to third 1 and try 2.
_____
1 1 2 \
a b c b a
_________
1 1 2 \ \
a b c b a # Entire pattern is a match!
也就是说,这个模式不仅匹配“只有当字符重复 2
n-1 次时才匹配”。它确实可以匹配
"abcba"
(如
ideone.com 上所见)。但是,它不能匹配
"ababa"
,也不能匹配
"aaaaa"
(请参阅 man 页面上的警告!),因为 PCRE 中的子模式递归是原子的。
您可以将此跟踪过程应用于解释模式在任何输入上的行为。