不了解正则表达式引擎,就无法确定这些是否正确。例如,POSIX ERE 是最长左匹配而不是最左长匹配,因此它将在一系列替代项中选择最长的,并因此将匹配整个字符串“ab”对应于“/a|ab/”。但是,通常会看到的回溯 NFA 会在这里执行其他操作:它会关心顺序,因此将相同的“ab”字符串与相同的“/a|ab/”模式进行匹配,将它们选择为开头部分“a”。
下一个问题是相同模式中的捕获组。如果它们是有意的,则很奇怪,因为您保留了两位数但没有保留一位数。其他模式不会这样做,但是它们被认为具有相同的行为。因此,我要假设它们在这里出现错误。否则,捕获组的内存使用当然会比不这样做花费更多。
下一个问题是完全缺乏任何锚点。同样,我们无法确定这些是否正确,因为不清楚输入集将是什么样子,以及特定引擎如何处理未锚定的模式。大多数引擎将在字符串的任何位置搜索,但一些不那么友好的编程人员会在那里“有用地”添加开头行(BOL)和结尾行(EOL)锚点。在更习惯的引擎中,如果没有发生这种情况,则在行中间的邮政编码也将匹配,因为五个数字显然包含一个和两个数字的子字符串。我无法猜测您是否需要“^”和“$”锚点或“\b”锚点。
因此,我不得不做一些猜测。我将保留锚点,但是我将重新排列第三个版本的分支,否则您将永远无法使用正常(非 POSIX)类型的回溯 NFA 匹配两位数。
即使在考虑时间之前,查看正则表达式编译器从这些模式中构建了什么样的程序也非常有益。
% perl -Mre=debug -ce '@pats = ( qr/[0-9]{1,2}/, qr/[0-9]?[0-9]/, qr/[0-9][0-9]|[0-9]/ )'
Compiling REx "[0-9]{1,2}"
Final program:
1: CURLY {1,2} (14)
3: ANYOF[0-9][] (0)
14: END (0)
stclass ANYOF[0-9][] minlen 1
Compiling REx "[0-9]?[0-9]"
synthetic stclass "ANYOF[0-9][]".
Final program:
1: CURLY {0,1} (14)
3: ANYOF[0-9][] (0)
14: ANYOF[0-9][] (25)
25: END (0)
stclass ANYOF[0-9][] minlen 1
Compiling REx "[0-9][0-9]|[0-9]"
Final program:
1: BRANCH (24)
2: ANYOF[0-9][] (13)
13: ANYOF[0-9][] (36)
24: BRANCH (FAIL)
25: ANYOF[0-9][] (36)
36: END (0)
minlen 1
-e syntax OK
Freeing REx: "[0-9]{1,2}"
Freeing REx: "[0-9]?[0-9]"
Freeing REx: "[0-9][0-9]|[0-9]"
查看已编译的模式确实是一个好主意。观察已编译的模式执行可能会更有教育意义。在这里,我们将同时观察两者:
% perl -Mre=debug -e '"aabbbababbaaqcccaaaabcacabba" =~ /abc|bca|cab|caab|baac|bab|aaa|bbb/'
Compiling REx "abc|bca|cab|caab|baac|bab|aaa|bbb"
Final program:
1: TRIEC-EXACT[abc] (25)
<abc>
<bca>
<cab>
<caab>
<baac>
<bab>
<aaa>
<bbb>
25: END (0)
stclass AHOCORASICKC-EXACT[abc] minlen 3
Matching REx "abc|bca|cab|caab|baac|bab|aaa|bbb" against "aabbbababbaaqcccaaaabcacabba"
Matching stclass AHOCORASICKC-EXACT[abc] against "aabbbababbaaqcccaaaabcacabba" (28 chars)
0 <> <aabbbababb> | Charid: 1 CP: 61 State: 1, word=0 - legal
1 <a> <abbbababba> | Charid: 1 CP: 61 State: 2, word=0 - legal
2 <aa> <bbbababbaa> | Charid: 2 CP: 62 State: 11, word=0 - fail
2 <aa> <bbbababbaa> | Fail transition to State: 2, word=0 - legal
3 <aab> <bbababbaaq> | Charid: 2 CP: 62 State: 3, word=0 - fail
3 <aab> <bbababbaaq> | Fail transition to State: 5, word=0 - legal
4 <aabb> <bababbaaqc> | Charid: 2 CP: 62 State: 13, word=0 - legal
5 <aabbb> <ababbaaqcc> | Charid: 1 CP: 61 State: 14, word=8 - accepting
Matches word
2 <aa> <bbbababbaa> | 1:TRIEC-EXACT[abc](25)
2 <aa> <bbbababbaa> | State: 1 Accepted: 0 Charid: 2 CP: 62 After State: 5
3 <aab> <bbababbaaq> | State: 5 Accepted: 0 Charid: 2 CP: 62 After State: 13
4 <aabb> <bababbaaqc> | State: 13 Accepted: 0 Charid: 2 CP: 62 After State: 14
5 <aabbb> <ababbaaqcc> | State: 14 Accepted: 1 Charid: 8 CP: 0 After State: 0
got 1 possible matches
only one match left:
5 <aabbb> <ababbaaqcc> | 25:END(0)
Match successful!
Freeing REx: "abc|bca|cab|caab|baac|bab|aaa|bbb"
编译器对此进行了巧妙的处理,将其编译为Aho-Corasick trie结构。显然,这与在同一程序上运行正常回溯NFA的方式有很大不同。
无论如何,以下是您模式的时间安排,或接近它们的时间安排。我为第二个添加了另一种表述,并交换了第三个备选方案的顺序。
testing against short_fail
Rate second first third second_alt
second 9488823/s -- -9
first 10475308/s 10
third 11998438/s 26
second_alt 13434377/s 42
testing against long_fail
Rate second first third second_alt
second 11221411/s -- -3
first 11618967/s 4
third 11776451/s 5
second_alt 11786700/s 5
testing against short_pass
Rate first second_alt second third
first 11720379/s -- -4
second_alt 12199048/s 4
second 12593191/s 7
third 12663378/s 8
testing against long_pass
Rate third second first second_alt
third 11135053/s -- -1
second 11221655/s 1
first 11716924/s 5
second_alt 12042240/s 8
这是由该程序生成的:
use Benchmark qw<cmpthese>;
$short_fail = "a" x 1;
$long_fail = "a" x 600;
$short_pass = $short_fail . 42;
$long_pass = $long_fail . 42;
for my $name (qw< short_fail long_fail short_pass long_pass >) {
print "\ntesting against $name\n";
$_ = $$name;
cmpthese 0 => {
first => '/[0-9]{1,2}/',
second => '/[0-9]?[0-9]/',
second_alt => '/[0-9][0-9]?/',
third => '/[0-9][0-9]|[0-9]/',
}
}
这里是添加了锚点的数字:
testing against short_fail
Rate first second second_alt third
first 11720380/s -- -3% -4% -11%
second 12058622/s 3% -- -1% -9%
second_alt 12180583/s 4% 1% -- -8%
third 13217006/s 13% 10% 9% --
testing against long_fail
Rate third first second_alt second
third 11378120/s -- -2% -4% -12%
first 11566419/s 2% -- -2% -10%
second_alt 11830740/s 4% 2% -- -8%
second 12860517/s 13% 11% 9% --
testing against short_pass
Rate second third second_alt first
second 11540465/s -- -5% -5% -7%
third 12093336/s 5% -- -0% -3%
second_alt 12118504/s 5% 0% -- -2%
first 12410348/s 8% 3% 2% --
testing against long_pass
Rate first second second_alt third
first 11423466/s -- -1% -4% -7%
second 11545540/s 1% -- -3% -7%
second_alt 11870086/s 4% 3% -- -4%
third 12348377/s 8% 7% 4% --
这是产生第二组数字的最小修改程序:
use Benchmark qw<cmpthese>;
$short_fail = 1 . "a";
$long_fail = 1 . "a" x 600;
$short_pass = 2;
$long_pass = 42;
for my $name (qw< short_fail long_fail short_pass long_pass >) {
print "testing against $name\n";
$_ = $$name;
cmpthese 0 => {
first => '/^(?:[0-9]{1,2})$/',
second => '/^(?:[0-9]?[0-9])$/',
second_alt => '/^(?:[0-9][0-9]?)$/',
third => '/^(?:[0-9][0-9]|[0-9])$/',
}
}