如何在Perl中找到正则表达式匹配的位置？

Question

如何在Perl中找到正则表达式匹配的位置？

regexperl

35

我需要编写一个函数，接收一个字符串和一个正则表达式。我需要检查是否有匹配项，并返回匹配项的起始位置和结束位置。（正则表达式已经通过qr//进行编译。）

该函数还可能接收一个“global”标志，然后我需要返回所有匹配项的（开始，结束）对。

我不能更改正则表达式，甚至不能在其周围添加()，因为用户可能会使用()和\1。也许我可以使用(?:)。

例如：给定“ababab”和正则表达式qr/ab/，在全局情况下，我需要返回3个（开始，结束）对。

- szabgab

从Leon的解释和我的解释来看，您可能需要澄清标志是与正则表达式中的/g修饰符还是任何()捕获相对应。 - Michael Carman

5个回答

22

忘记我的先前文章，我有一个更好的想法。

sub match_positions {
    my ($regex, $string) = @_;
    return if not $string =~ /$regex/;
    return ($-[0], $+[0]);
}
sub match_all_positions {
    my ($regex, $string) = @_;
    my @ret;
    while ($string =~ /$regex/g) {
        push @ret, [ $-[0], $+[0] ];
    }
    return @ret
}

这种技术不会改变正则表达式本身。

编辑补充：引用自perlvar关于$1..$9的说明。"这些变量都是只读的，并且动态作用域限制在当前块中。"换句话说，如果你想使用$1..$9，你不能使用一个子例程来匹配。

- Leon Timmermans

您可以使用子例程来进行匹配，但如果您想要捕获匹配结果，就需要使用substr()、@-和@+来提取匹配内容并将其返回给用户。 - Michael Carman

正确，但那是一件特别麻烦的事情。 - Leon Timmermans

你的match_positions函数在可能的某些分支中返回undef，在其余情况下返回一个数组。这真的可以吗？ - antred

8

pos函数可以给出匹配的位置。如果你在正则表达式中加上括号，你就可以使用length $1来获取长度（从而得到结束位置）。像这样：

sub match_positions {
    my ($regex, $string) = @_;
    return if not $string =~ /($regex)/;
    return (pos($string) - length $1, pos($string));
}
sub all_match_positions {
    my ($regex, $string) = @_;
    my @ret;
    while ($string =~ /($regex)/g) {
        push @ret, [pos($string) - length $1, pos($string)];
    }
    return @ret
}

- Leon Timmermans

3

这看起来完全错误。在所有匹配位置中，不要使用pos，而是使用pos($string)。在另一种情况下，match_positions根本无效。 - Aftershock

1

return if not $string =~ /($regex)/; 这样写会导致你无法正确调用 pos($string)。 - huckfinn

0

#!/usr/bin/perl

# search the postions for the CpGs in human genome

sub match_positions {
    my ($regex, $string) = @_;
    return if not $string =~ /($regex)/;
    return (pos($string), pos($string) + length $1);
}
sub all_match_positions {
    my ($regex, $string) = @_;
    my @ret;
    while ($string =~ /($regex)/g) {
        push @ret, [(pos($string)-length $1),pos($string)-1];
    }
    return @ret
}

my $regex='CG';
my $string="ACGACGCGCGCG";
my $cgap=3;    
my @pos=all_match_positions($regex,$string);

my @hgcg;

foreach my $pos(@pos){
    push @hgcg,@$pos[1];
}

foreach my $i(0..($#hgcg-$cgap+1)){
my $len=$hgcg[$i+$cgap-1]-$hgcg[$i]+2;
print "$len\n"; 
}

- Shicheng Guo

0

如果你愿意让程序中的所有正则表达式执行得更慢，你也可以使用已弃用的 $` 变量。来自 perlvar：

   $‘      The string preceding whatever was matched by the last successful pattern match (not
           counting any matches hidden within a BLOCK or eval enclosed by the current BLOCK).
           (Mnemonic: "`" often precedes a quoted string.)  This variable is read-only.

           The use of this variable anywhere in a program imposes a considerable performance penalty
           on all regular expression matches.  See "BUGS".

- zigdon

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Michael Carman · Accepted Answer

内置变量@-和@+分别保存上一次成功匹配的起始位置和结束位置。$-[0]和$+[0]对应整个模式，而$-[N]和$+[N]则对应于第$N个子匹配（$1，$2等）。