我有一个很大的文件aab.txt
,其内容为aaa
...aab
。
令我惊讶的是
perl -ne '/a*bb/' < aab.txt
运行(匹配失败)比更快
perl -ne '/a*b/' < aab.txt
(匹配成功)。为什么????两个正则表达式都应该先吞咽所有的a,然后第二个表达式立即成功,而第一个正则表达式将不得不一遍又一遍地回溯,以失败。
b
或bb
。可以相当有效地检查此部分,而无需跟踪回溯状态。没有发现bb
,匹配会在那里中止。b
不是这样。找到该浮动子字符串,并从那里构建匹配。以下是正则表达式匹配的调试输出(程序是"aaab" =~ /a*b/
):Compiling REx "a*b"
synthetic stclass "ANYOF_SYNTHETIC[ab][]".
Final program:
1: STAR (4)
2: EXACT <a> (0)
4: EXACT <b> (6)
6: END (0)
floating "b" at 0..2147483647 (checking floating) stclass ANYOF_SYNTHETIC[ab][] minlen 1
Guessing start of match in sv for REx "a*b" against "aaab"
Found floating substr "b" at offset 3...
start_shift: 0 check_at: 3 s: 0 endpos: 4 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "a*b" against "aaab"
Matching stclass ANYOF_SYNTHETIC[ab][] against "aaab" (4 bytes)
0 <> <aaab> | 1:STAR(4)
EXACT <a> can match 3 times out of 2147483647...
3 <aaa> <b> | 4: EXACT <b>(6)
4 <aaab> <> | 6: END(0)
Match successful!
Freeing REx: "a*b"
您可以通过在re
模块的debug
选项中设置来获得这样的输出。
严格来说,查找b
或bb
是不必要的,但它可以使匹配失败更早。
/a*bb/
/^(?s:.*?)a*bb/
*
。除了优化,它是二次的。在最坏的情况下(一个由所有a
组成的字符串),对于长度为N的字符串,它将检查当前字符是否为a
N*(N-1)/2次。我们称之为O(N2)。perl -Mre=debug -e"'aaaaab' =~ /a*bb/"
You get information about the compilation of the pattern:
Compiling REx "a*bb"
synthetic stclass "ANYOF{i}[ab][{non-utf8-latin1-all}]".
Final program:
1: STAR (4)
2: EXACT <a> (0)
4: EXACT <bb> (6)
6: END (0)
floating "bb" at 0..2147483647 (checking floating) stclass ANYOF{i}[ab][{non-utf8-latin1-all}] minlen 2
The last line indicates it will search for bb
in the input before starting to match.
You get information about the evaluation of the pattern:
Guessing start of match in sv for REx "a*bb" against "aaaaab"
Did not find floating substr "bb"...
Match rejected by optimizer
Here you see that check in action.
<
。这就是-n
标志隐式为您执行的操作。 - squiguy