为什么Perl回溯匹配失败的时间似乎比匹配成功要少？

Question

为什么Perl回溯匹配失败的时间似乎比匹配成功要少？

8

我有一个很大的文件aab.txt，其内容为aaa...aab。

令我惊讶的是

perl -ne '/a*bb/' < aab.txt

运行（匹配失败）比更快

perl -ne '/a*b/' < aab.txt

（匹配成功）。为什么？？？？两个正则表达式都应该先吞咽所有的a，然后第二个表达式立即成功，而第一个正则表达式将不得不一遍又一遍地回溯，以失败。

- Mark Galeck

1

不需要 <。这就是 -n 标志隐式为您执行的操作。 - squiguy

谢谢回答者和评论者。我只能接受一个答案，但我应该接受两个。 - Mark Galeck

2个回答

6

/a*bb/

基本上就是

/^(?s:.*?)a*bb/

请注意这两个*。除了优化，它是二次的。在最坏的情况下（一个由所有a组成的字符串），对于长度为N的字符串，它将检查当前字符是否为a N*(N-1)/2次。我们称之为O(N²)。

在开始匹配之前，值得扫描一下字符串（O(N))，看看它是否可能匹配。匹配需要更长的时间，但失败的匹配会更快。这就是Perl所做的。

当你运行以下内容时：

perl -Mre=debug -e"'aaaaab' =~ /a*bb/"

You get information about the compilation of the pattern:

Compiling REx "a*bb"
synthetic stclass "ANYOF{i}[ab][{non-utf8-latin1-all}]".
Final program:
   1: STAR (4)
   2:   EXACT <a> (0)
   4: EXACT <bb> (6)
   6: END (0)
floating "bb" at 0..2147483647 (checking floating) stclass ANYOF{i}[ab][{non-utf8-latin1-all}] minlen 2

The last line indicates it will search for bb in the input before starting to match.

You get information about the evaluation of the pattern:

Guessing start of match in sv for REx "a*bb" against "aaaaab"
Did not find floating substr "bb"...
Match rejected by optimizer

Here you see that check in action.

- ikegami

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- amon · Accepted Answer

Perl的正则表达式被优化为尽早失败，而不是尽快成功。在搜索大型日志文件时，这是非常明智的。

有一个优化方法，首先查找字符串的常量部分，例如“浮动”的b或bb。可以相当有效地检查此部分，而无需跟踪回溯状态。没有发现bb，匹配会在那里中止。

但b不是这样。找到该浮动子字符串，并从那里构建匹配。以下是正则表达式匹配的调试输出（程序是"aaab" =~ /a*b/）：

Compiling REx "a*b"
synthetic stclass "ANYOF_SYNTHETIC[ab][]".
Final program:
   1: STAR (4)
   2:   EXACT <a> (0)
   4: EXACT <b> (6)
   6: END (0)
floating "b" at 0..2147483647 (checking floating) stclass ANYOF_SYNTHETIC[ab][] minlen 1 
Guessing start of match in sv for REx "a*b" against "aaab"
Found floating substr "b" at offset 3...
start_shift: 0 check_at: 3 s: 0 endpos: 4 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "a*b" against "aaab"
Matching stclass ANYOF_SYNTHETIC[ab][] against "aaab" (4 bytes)
   0 <> <aaab>               |  1:STAR(4)
                                  EXACT <a> can match 3 times out of 2147483647...
   3 <aaa> <b>               |  4:  EXACT <b>(6)
   4 <aaab> <>               |  6:  END(0)
Match successful!
Freeing REx: "a*b"

您可以通过在re模块的debug选项中设置来获得这样的输出。

严格来说，查找b或bb是不必要的，但它可以使匹配失败更早。