在Perl或Matlab正则表达式中使用滑动窗口模式匹配

Question

在Perl或Matlab正则表达式中使用滑动窗口模式匹配

4

我希望使用Perl或MATLAB从一行文本中解析出几个数字。我的文本行是：

t10_t20_t30_t40_

在MATLAB中，我使用了以下脚本：

str = 't10_t20_t30_t40_';
a = regexp(str,'t(\d+)_t(\d+)','match')

它返回

a = 

't10_t20'    't30_t40'

我希望它也能返回“t20_t30”，因为这显然是匹配的。为什么正则表达式没有扫描到呢？

因此，我转向了Perl，并在Perl中编写了以下内容：

#!/usr/bin/perl -w
$str = "t10_t20_t30_t40_";
while($str =~ /(t\d+_t\d+)/g)
{
    print "$1\n";
}

结果与matlab相同。

t10_t20
t30_t40

但我真的希望"t20_t30"也能出现在结果中。

有人能告诉我如何实现吗？谢谢！

[更新解决方案]：在同事的帮助下，我使用了Perl所提供的所谓"环视断言"来解决问题。

#!/usr/bin/perl -w
$str = "t10_t20_t30_t40_";
while($str =~ m/(?=(t\d+_t\d+))/g)
{print "$1\n";}

关键是在Perl中使用"零宽度先行断言"。当Perl(和其他类似的软件包)使用regexp扫描字符串时，它不会重新扫描上次匹配中已经扫描过的内容。所以在上面的例子中，t20_t30将永远不会出现在结果中。为了捕获它，我们需要使用零宽度先行搜索来扫描字符串，产生不会从后续搜索中排除任何子字符串的匹配项(请参见上述工作代码)。如果在搜索中附加了"global"修饰符(即m//g)，则搜索将从零位置开始，并增加一次尽可能多的次数，使其成为"贪婪"搜索。

这在这篇博客文章中有更详细的说明。

表达式(?=t\d+_t\d+)匹配任何紧随0宽度字符串后面的t\d+_t\d+，从而创建实际的"滑动窗口"。这有效地返回$str中的所有t\d+_t\d+模式，而不排除任何内容，因为$str中的每个位置都是0宽度字符串。额外的括号正在进行滑动匹配(?=(t\d+_t\d+))，因此返回所需的滑动窗口结果。

- Xianrui Cheng

2个回答

0

一旦regexp算法找到匹配项，匹配的字符就不会再考虑进行进一步的匹配（通常情况下，这也是人们想要的结果，例如.*并不应该匹配此帖子中每个可能的连续子字符串）。解决方法是在第一个匹配项后面的一个字符处重新开始搜索，并收集结果：

str = 't10_t20_t30_t40_';
sub_str = str;
reg_ex = 't(\d+)_t(\d+)';
start_idx = 0;
all_start_indeces = [];
all_end_indeces = [];
off_set = 0;
%// While there are matches later in the string and the first match of the
%// remaining string is not the last character
while ~isempty(start_idx) && (start_idx < numel(str))
    %// Calculate offset to original string
    off_set = off_set + start_idx;
    %// extract string starting at first character after first match
    sub_str = sub_str((start_idx + 1):end);
    %// find further matches
    [start_idx, end_idx] = regexp(sub_str, reg_ex, 'once');
    %// save match if any
    if ~isempty(start_idx)
        all_start_indeces = [all_start_indeces, start_idx + off_set];
        all_end_indeces = [all_end_indeces, end_idx + off_set];
    end
end
display(all_start_indeces)
display(all_end_indeces)
matched_strings = arrayfun(@(st, en) str(st:en), all_start_indeces, all_end_indeces, 'uniformoutput', 0)

- zeeMonkeez

这是一个好的解决方案，但是我的声望在该网站上太低了，无法为其投票支持...抱歉。 - Xianrui Cheng

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Toto · Accepted Answer

使用Perl：

#!/usr/bin/perl
use Data::Dumper;
use Modern::Perl;

my $re = qr/(?=(t\d+_t\d+))/;

my @l = 't10_t20_t30_t40' =~  /$re/g;
say Dumper(\@l);

输出：

$VAR1 = [
          't10_t20',
          't20_t30',
          't30_t40'
        ];