Perl是否缓存正则表达式生成？

Question

Perl是否缓存正则表达式生成？

6

假设我有一个动态生成正则表达式并进行匹配的函数。

例如，在以下函数中，match_here会在正则表达式开头插入\G锚定符号。这简化了API，因为调用者不需要记住在模式中包含pos锚定符号。

#!/usr/bin/env perl
use strict;
use warnings;
use Carp;
use Data::Dumper;

sub match_here {
  my ($str, $index, $rg) = @_;
  pos($str) = $index;
  croak "index ($index) out of bounds" unless pos($str) == $index;
  my $out;
  if ($str =~ /\G$rg/) {
    $out = $+[0]; 
  }
  return $out;
}

# no match starting at position 0
# prints '$VAR1 = undef;'    
print Dumper(match_here("abc", 0, "b+"));
# match from 1 to 2
# prints '$VAR1 = 2;'
print Dumper(match_here("abc", 1, "b+"));

我想知道每次函数被评估时，匿名正则表达式对象是否会"编译"，或者是否有缓存，使得相同的字符串不会导致额外的正则表达式对象被编译。

另外，假设Perl解释器没有进行缓存，编译一个正则表达式对象是否足够昂贵，值得缓存（可能在XS扩展中）？

- Greg Nisbet

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user2404501 · Accepted Answer

来自m//运算符下的perlop(1)：

PATTERN可以包含变量，每次模式匹配计算都会被插值。

[...]

Perl只有在包含的插值变量发生改变时才会重新编译模式。您可以添加一个尾部分隔符后面的"/o"（表示"once"）来强制Perl跳过测试并永远不重新编译。曾经，Perl会无谓地重新编译正则表达式，因此这个修饰符非常有用，可以告诉它不要这样做，从而提高速度。

所以，确实存在缓存，您甚至可以通过使用/o来强制使用无效的缓存，但您真的不应该这样做。

但是，该缓存仅为每个m//或s///运算符的实例存储了一个已编译的正则表达式，因此仅在与相同变量（例如，您的$rg）连续多次使用该正则表达式时才有效。如果您在使用$rg='b+'和$rg='c+'之间交替调用它，则每次都会重新编译。

对于这种情况，您可以使用qr//运算符来进行自己的缓存。它显式编译正则表达式并返回一个对象，您可以将其存储并在以后执行正则表达式。可以将其纳入您的match_here中：

use feature 'state';

sub match_here {
  my ($str, $index, $rg) = @_;
  pos($str) = $index;
  croak "index ($index) out of bounds" unless pos($str) == $index;
  my $out;
  state %rg_cache;
  my $crg = $rg_cache{$rg} ||= qr/\G$rg/;
  if ($str =~ /$crg/) {
    $out = $+[0]; 
  }
  return $out;
}

关于基本缓存（未使用qr//）的更多细节：每次分配新的词法变量$rg并不重要，重要的是值与上一个相同。

以下是一个例子来证明这一点：

use re qw(Debug COMPILE);

while(<>) {
  chomp;
  # Insane interpolation. Do not use anything remotely like this in real code
  print "MATCHED: $_\n" if /^${\(`cat refile`)}/;
}

每次匹配运算符执行时，它都会读取refile。正则表达式是以^开头，后跟refile的内容。调试输出显示，仅当文件内容发生更改时才重新编译。如果文件仍具有与上次相同的内容，则运算符会注意到相同的字符串再次传递给正则表达式编译器，并重用缓存的结果。

或者尝试这个不那么激烈的例子：

use re qw(Debug COMPILE);

@patterns = (
  '\d{3}',
  '\d{3}',
  '[aeiou]',
  '[aeiou]',
  '\d{3}',
  '\d{3}'
);

for ('xyz', '123', 'other') {
  for $i (0..$#patterns) {
    if(/$patterns[$i]/) {
      print "$_ matches $patterns[$i]\n";
    } else {
      print "$_ does not match $patterns[$i]\n";
    }
  }
}

其中有18个编译，其中11个是缓存命中，即使相同的“变量”（@patterns数组的相同元素）从未连续使用。