The ultimate answer to such a question is in the engine's code, and at the bottom of the answer you'll be able to dive into the section of the PCRE engine's code responsible for ensuring fixed-length in lookbehinds—if you're interested in knowing the finest details. In the meantime, let's gradually zoom into the question from higher levels.
Variable-Width Lookbehind vs. Infinite-Width Lookbehind
First off, a quick clarification on terms. A growing number of engines (including PCRE) support some form of variable-width lookbehind, where the variation falls within a determined range, for instance:
- the engine knows that the width of what precedes must be within 5 to ten characters (not supported in PCRE)
- the engine knows that the width of what precedes must be either 5 or ten character (supported in PCRE)
In contrast, in infinite-width lookbehind, you can use quantified tokens such as "a+".
Engines that Support Infinite-Width Lookbehind
For the record, these engines support infinite lookbehind:
- .NET (C#,VB.NET等)
- Matthew Barnett的
Python
正则表达式模块
- JGSoft(EditPad等;不适用于编程语言)。
据我所知,它们是唯一的。
PCRE中的可变后向引用
在PCRE中,文档中最相关的部分如下:
后向引用断言的内容受到限制,以使其匹配的所有字符串都必须具有固定长度。但是,如果有多个顶级替代方案,则它们并不都必须具有相同的固定长度。
因此,以下后向引用是有效的:
(?<=a |big )cat
然而,这些都不是固定宽度的:
(?<=a\s?|big )cat
(选择的两端没有固定的宽度)
(?<=@{1,10})cat
(宽度可变)
(?<=\R)cat
(\R
没有固定宽度,因为它可以匹配 \n
、\r\n
等)
(?<=\X)cat
(\X
没有固定宽度,因为 Unicode 图形簇可以包含可变数量的字节。)
(?<=a+)cat
(显然不是固定的)
具有零宽度匹配但无限重复的向后查找
现在考虑这个:
(?<=(?=@+))(cat#+)
表面上看,这是一个固定宽度的后顾断言,因为它只能找到零宽度匹配(由前瞻
(?=@++)
定义)。这是绕过无限后顾限制的技巧吗?
不是。PCRE 会卡住。即使后顾的内容是零宽度,PCRE 也不允许在后顾中进行无限重复。任何地方都不行。当文档说它匹配的所有字符串必须具有固定长度时,实际上应该是:
所有组件匹配的字符串都必须具有固定长度。
解决办法:没有无限后顾的生活
在 PCRE 中,解决无限后顾无法帮助的问题的两个主要方法是 \K 和捕获组。
解决办法 #1:\K
\K 断言告诉引擎从它返回的最终匹配中删除到目前为止匹配的内容。
假设你想要 (?<=@+)cat#+,这在 PCRE 中是不合法的。相反,您可以使用:
@+\Kcat#+
解决方法 #2: 捕获组
另一种解决方法是匹配您原本想要放在回顾后面的内容,并在捕获组中捕获感兴趣的内容。然后从捕获组中检索匹配项。
例如,您可以使用以下代码而不是非法的 (?<=@+)cat#+
:
@+(cat#+)
在R中,这可能看起来像这样:
matches <- regexpr("@+(cat#+)", subject, perl=TRUE);
result <- attr(matches, "capture.start")[,1]
attr(result, "match.length") <- attr(matches, "capture.length")[,1]
regmatches(subject, result)
在不支持\K
的语言中,这通常是唯一的解决方案。
引擎内部:PCRE代码说了什么?
最终的答案可以在pcre_compile.c
中找到。如果您检查以以下注释开头的代码块:
如果是lookbehind,则检查此分支是否与固定长度字符串匹配
您会发现繁重的工作由find_fixedlength()
函数完成。
我在此为想要深入了解的任何人复制它。
static int
find_fixedlength(pcre_uchar *code, BOOL utf, BOOL atend, compile_data *cd)
{
int length = -1;
register int branchlength = 0;
register pcre_uchar *cc = code + 1 + LINK_SIZE;
for (;;)
{
int d;
pcre_uchar *ce, *cs;
register pcre_uchar op = *cc;
switch (op)
{
case OP_CBRA:
case OP_BRA:
case OP_ONCE:
case OP_ONCE_NC:
case OP_COND:
d = find_fixedlength(cc + ((op == OP_CBRA)? IMM2_SIZE : 0), utf, atend, cd);
if (d < 0) return d;
branchlength += d;
do cc += GET(cc, 1); while (*cc == OP_ALT);
cc += 1 + LINK_SIZE;
break;
case OP_ALT:
case OP_KET:
case OP_END:
case OP_ACCEPT:
case OP_ASSERT_ACCEPT:
if (length < 0) length = branchlength;
else if (length != branchlength) return -1;
if (*cc != OP_ALT) return length;
cc += 1 + LINK_SIZE;
branchlength = 0;
break;
case OP_RECURSE:
if (!atend) return -3;
cs = ce = (pcre_uchar *)cd->start_code + GET(cc, 1);
do ce += GET(ce, 1); while (*ce == OP_ALT);
if (cc > cs && cc < ce) return -1;
d = find_fixedlength(cs + IMM2_SIZE, utf, atend, cd);
if (d < 0) return d;
branchlength += d;
cc += 1 + LINK_SIZE;
break;
case OP_ASSERT:
case OP_ASSERT_NOT:
case OP_ASSERTBACK:
case OP_ASSERTBACK_NOT:
do cc += GET(cc, 1); while (*cc == OP_ALT);
cc += PRIV(OP_lengths)[*cc];
break;
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
case OP_THEN_ARG:
cc += cc[1] + PRIV(OP_lengths)[*cc];
break;
case OP_CALLOUT:
case OP_CIRC:
case OP_CIRCM:
case OP_CLOSE:
case OP_COMMIT:
case OP_CREF:
case OP_DEF:
case OP_DNCREF:
case OP_DNRREF:
case OP_DOLL:
case OP_DOLLM:
case OP_EOD:
case OP_EODN:
case OP_FAIL:
case OP_NOT_WORD_BOUNDARY:
case OP_PRUNE:
case OP_REVERSE:
case OP_RREF:
case OP_SET_SOM:
case OP_SKIP:
case OP_SOD:
case OP_SOM:
case OP_THEN:
case OP_WORD_BOUNDARY:
cc += PRIV(OP_lengths)[*cc];
break;
case OP_CHAR:
case OP_CHARI:
case OP_NOT:
case OP_NOTI:
branchlength++;
cc += 2;
#ifdef SUPPORT_UTF
if (utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);
#endif
break;
case OP_EXACT:
case OP_EXACTI:
case OP_NOTEXACT:
case OP_NOTEXACTI:
branchlength += (int)GET2(cc,1);
cc += 2 + IMM2_SIZE;
#ifdef SUPPORT_UTF
if (utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);
#endif
break;
case OP_TYPEEXACT:
branchlength += GET2(cc,1);
if (cc[1 + IMM2_SIZE] == OP_PROP || cc[1 + IMM2_SIZE] == OP_NOTPROP)
cc += 2;
cc += 1 + IMM2_SIZE + 1;
break;
case OP_PROP:
case OP_NOTPROP:
cc += 2;
case OP_HSPACE:
case OP_VSPACE:
case OP_NOT_HSPACE:
case OP_NOT_VSPACE:
case OP_NOT_DIGIT:
case OP_DIGIT:
case OP_NOT_WHITESPACE:
case OP_WHITESPACE:
case OP_NOT_WORDCHAR:
case OP_WORDCHAR:
case OP_ANY:
case OP_ALLANY:
branchlength++;
cc++;
break;
case OP_ANYBYTE:
return -2;
case OP_CLASS:
case OP_NCLASS:
#if defined SUPPORT_UTF || defined COMPILE_PCRE16 || defined COMPILE_PCRE32
case OP_XCLASS:
if (op == OP_XCLASS)
cc += GET(cc, 1);
else
cc += PRIV(OP_lengths)[OP_CLASS];
#else
cc += PRIV(OP_lengths)[OP_CLASS];
#endif
switch (*cc)
{
case OP_CRSTAR:
case OP_CRMINSTAR:
case OP_CRPLUS:
case OP_CRMINPLUS:
case OP_CRQUERY:
case OP_CRMINQUERY:
case OP_CRPOSSTAR:
case OP_CRPOSPLUS:
case OP_CRPOSQUERY:
return -1;
case OP_CRRANGE:
case OP_CRMINRANGE:
case OP_CRPOSRANGE:
if (GET2(cc,1) != GET2(cc,1+IMM2_SIZE)) return -1;
branchlength += (int)GET2(cc,1);
cc += 1 + 2 * IMM2_SIZE;
break;
default:
branchlength++;
}
break;
case OP_ANYNL:
case OP_BRAMINZERO:
case OP_BRAPOS:
case OP_BRAPOSZERO:
case OP_BRAZERO:
case OP_CBRAPOS:
case OP_EXTUNI:
case OP_KETRMAX:
case OP_KETRMIN:
case OP_KETRPOS:
case OP_MINPLUS:
case OP_MINPLUSI:
case OP_MINQUERY:
case OP_MINQUERYI:
case OP_MINSTAR:
case OP_MINSTARI:
case OP_MINUPTO:
case OP_MINUPTOI:
case OP_NOTMINPLUS:
case OP_NOTMINPLUSI:
case OP_NOTMINQUERY:
case OP_NOTMINQUERYI:
case OP_NOTMINSTAR:
case OP_NOTMINSTARI:
case OP_NOTMINUPTO:
case OP_NOTMINUPTOI:
case OP_NOTPLUS:
case OP_NOTPLUSI:
case OP_NOTPOSPLUS:
case OP_NOTPOSPLUSI:
case OP_NOTPOSQUERY:
case OP_NOTPOSQUERYI:
case OP_NOTPOSSTAR:
case OP_NOTPOSSTARI:
case OP_NOTPOSUPTO:
case OP_NOTPOSUPTOI:
case OP_NOTQUERY:
case OP_NOTQUERYI:
case OP_NOTSTAR:
case OP_NOTSTARI:
case OP_NOTUPTO:
case OP_NOTUPTOI:
case OP_PLUS:
case OP_PLUSI:
case OP_POSPLUS:
case OP_POSPLUSI:
case OP_POSQUERY:
case OP_POSQUERYI:
case OP_POSSTAR:
case OP_POSSTARI:
case OP_POSUPTO:
case OP_POSUPTOI:
case OP_QUERY:
case OP_QUERYI:
case OP_REF:
case OP_REFI:
case OP_DNREF:
case OP_DNREFI:
case OP_SBRA:
case OP_SBRAPOS:
case OP_SCBRA:
case OP_SCBRAPOS:
case OP_SCOND:
case OP_SKIPZERO:
case OP_STAR:
case OP_STARI:
case OP_TYPEMINPLUS:
case OP_TYPEMINQUERY:
case OP_TYPEMINSTAR:
case OP_TYPEMINUPTO:
case OP_TYPEPLUS:
case OP_TYPEPOSPLUS:
case OP_TYPEPOSQUERY:
case OP_TYPEPOSSTAR:
case OP_TYPEPOSUPTO:
case OP_TYPEQUERY:
case OP_TYPESTAR:
case OP_TYPEUPTO:
case OP_UPTO:
case OP_UPTOI:
return -1;
default:
return -4;
}
}
}
\K
代码,它是一个特殊形式的后顾断言,可以是可变的。因此,在您的第二个示例中,以下内容将在perl中起作用:/a*\Kb(?=c*)/
。显然,使用可以为零宽度的断言有点毫无意义,因此也许使用+
会更好一些。 - Miller