有没有更好的方法来编写带有/x的Perl正则表达式，使得代码仍然易于阅读？

Question

有没有更好的方法来编写带有/x的Perl正则表达式，使得代码仍然易于阅读？

9

我在我的一个脚本上运行了Perl::Critic，并收到了以下信息：

Regular expression without "/x" flag at line 21, column 26. See page 236 of PBP.

我查询了这里的策略信息，我明白在扩展模式下编写正则表达式将有助于查看代码的任何人。

然而，我卡在如何将我的代码转换为使用/x标志上。 CPAN示例：

# Match a single-quoted string efficiently...

m{'[^\\']*(?:\\.[^\\']*)*'};  #Huh?

# Same thing with extended format...

m{
    '           # an opening single quote
    [^\\']      # any non-special chars (i.e. not backslash or single quote)
    (?:         # then all of...
        \\ .    #    any explicitly backslashed char
        [^\\']* #    followed by an non-special chars
    )*          # ...repeated zero or more times
    '           # a closing single quote
}x;

这只有在你仅查看正则表达式时才有意义。 我的代码：

if ($line =~ /^\s*package\s+(\S+);/ ) {

我不太确定如何在 if 语句中使用扩展正则表达式。我可以像这样编写：

    if (
        $line =~ /
        ^\s*    # starting with zero or more spaces
        package
        \s+     # at least one space
        (\S+)   # capture any non-space characters
        ;       # ending in a semi-colon
        /x
      )
    {

这段代码可以工作，但我认为这几乎比原始代码更难阅读。是否有更好的方法（或最佳实践方式）编写此代码？我猜我可以使用qr //创建变量。

我不是真的在寻求重写这个特定正则表达式的建议（尽管如果我能改进它，我会接受建议），我更多地希望获得有关如何扩展if语句中的正则表达式的建议。

我知道Perl :: Critic只是一个指导方针，但遵循它会很好。

提前感谢！

编辑：经过一些答案的解释，对我来说清楚了，使用注释将正则表达式拆分成多行并不总是必要的。了解基本正则表达式的人应该能够理解我的示例正在做什么-我添加的注释可能有点不必要和冗长。我喜欢使用扩展正则表达式标志的想法，但仍然嵌入正则表达式中的空格，以使每个部分的正则表达式更加清晰。感谢所有的建议！

- BrianH

5个回答

11

我真的认为您不应该在此处浪费垂直屏幕空间。另一方面，如果我要将此模式写成多行，则会使用大括号并缩进该模式：

if ($line =~ m{
        \A \s*
        package
        \s+
        (\S+)
        \s* ;
    }x 
) {

在我看来，下面这个版本非常好：

if ( $line =~ m{ \A \s* package \s+ (\S+) \s* ; }x  ) {

关于获得m // x的好处。

在这种情况下，注释是完全不必要的，因为您没有做任何棘手的事情。我在分号前添加了\s*，因为有时人们会将分号与包名称分开，这不应该影响您的匹配。

- Sinan Ünür

我不得不去http://www.perl.com/doc/manual/html/pod/perlre.html查看"\A"的含义。这是比"^"更受欢迎的方式吗？ - BrianH

2

我猜我之前没有考虑过在单行正则表达式中添加空格。我总是把“/x”标志看作是一个多行标志，但我真的很喜欢你上面的例子。 - BrianH

2

@BrianH：不完全是这样。只有在使用/m时才有区别，而当您使用/m时，通常希望使用^而不是\A。另一方面，$经常用于人们实际上想要使用\z的情况。 - ysth

@ysth，我确实喜欢\A和\z之间的对称性。然而，我上面介绍\A的做法确实是不必要的。 - Sinan Ünür

8

关于这些额外信息所增加的价值，完全取决于您。

有时候，您是对的，解释正在发生的事情并不会增加什么，只会让代码看起来乱七八糟，但对于复杂的正则表达式，x标志可以帮助很多。

实际上，关于是否添加附加信息的“做出决策”可能相当困难。

我不记得看过多少遗留代码，美观格式的注释没有被维护，因此与代码无关。事实上，当我经验较少时，由于与一段代码相关联的注释已经过时而没有被维护，我曾经走了完全错误的道路。

编辑：在某些方面，CPAN示例并不是那么有用。使用x标志为复杂的正则表达式添加注释时，我倾向于描述正则表达式尝试匹配的组件，而不仅仅是描述正则表达式本身。例如，我会写出以下内容：

英国邮政编码的第一个组件（区域和地区），或者
英国的国际区号，或者
任何英国移动电话号码。

这比

一个或两个字母，后跟一个数字，可选地后跟一个字母，或者
两个四位数字组合在一起，或者
一个零，后跟四个十进制数字，一个破折号，然后是六个十进制数字。

告诉我更多信息。在这种情况下，我的感觉是不需要在正则表达式中添加注释。您的直觉是正确的！

- Rob Wells

1

非常好的编辑，关于描述正则表达式。我陷入了描述正则表达式正在做什么的陷阱中（比如“捕获任何非空格字符”），而也许像“捕获包名”这样更清晰的描述可能更好。如果可以的话，我会再次点赞你的帖子！ - BrianH

谢谢 @BrianH。在一行 C 代码 "I++;" 上方加上 "# add 1 to i" 这样的注释真是太令人头疼了;-) - Rob Wells

6

由于这个主题是关于替代正则表达式编写方式的，有一些方法可以编写复杂的正则表达式而不需要变量和注释，但仍然有用。

我将Chas Owens的日期验证正则表达式转换为Perl-5.10中提供的新的声明形式，这带来了许多好处。

正则表达式中的标记是可重用的
稍后打印正则表达式的任何人仍然会看到整个逻辑树。

虽然可能并非每个人都会使用，但对于像日期验证这样极其复杂的事情来说，它可能很方便（提示：在现实世界中，请使用日期模块，不要自己动手，这只是一个学习的例子）。

#!/usr/bin/perl 
use strict;
use warnings;
require 5.010;

#match the 1st through 28th for any month of any year
my $date_syntax = qr{
    (?(DEFINE)
        (?<century>
            ( 1[6-9] | [2-9][0-9] )
        )
        (?<decade>
            [0-9]{2} (?!\d)
        )
        (?<year>
            (?&century)? (?&decade)(?!\d)
        )
        (?<leapdecade> (
            0[48]       | 
            [2468][048] |
            [13579][26]
            )(?!\d)
        )
        (?<leapcentury> (
            16          | 
            [2468][048] |
            [3579][26]
            )
        )   
        (?<leapyear>
            (?&century)?(?&leapdecade)(?!\d)
            |
            (?&leapcentury)00(?!\d)
        )
        (?<monthnumber>      ( 0?[1-9] | 1[0-2] )(?!\d)                  )
        (?<shortmonthnumber> ( 0?[469] | 11     )(?!\d)                  )
        (?<longmonthnumber>  ( 0?[13578] | 1[02] )(?!\d)                 )
        (?<nonfebmonth>      ( 0?[13-9] | 1[0-2] )(?!\d)                 )
        (?<febmonth>         ( 0?2 )(?!\d)                               )
        (?<twentyeightdays>  ( 0?[1-9] | 1[0-9] | 2[0-8] )(?!\d)         )
        (?<twentyninedays>   ( (?&twentyeightdays) | 29 )(?!\d)          )
        (?<thirtydays>       ( (?&twentyeightdays) | 29 | 30 )(?!\d)     )
        (?<thirtyonedays>    ( (?&twentyeightdays) | 29 | 30 | 31 )(?!\d))
        (?<separator>        [/.-]                              )               #/ markdown syntax highlighter fix
        (?<ymd>
            (?&leapyear) (?&separator) (?&febmonth) (?&separator) (?&twentyninedays) (?!\d)
            |
            (?&year) (?&separator) (?&longmonthnumber) (?&separator) (?&thirtyonedays) (?!\d)
            |
            (?&year) (?&separator) (?&shortmonthnumber) (?&separator) (?&thirtydays) (?!\d)
            |
            (?&year) (?&separator) (?&febmonth) (?&separator) (?&twentyeightdays) (?!\d)
        )
        (?<mdy>
            (?&febmonth) (?&separator) (?&twentyninedays) (?&separator) (?&leapyear)  (?!\d)
            |
            (?&longmonthnumber) (?&separator) (?&thirtyonedays) (?&separator) (?&year) (?!\d)
            |
            (?&shortmonthnumber) (?&separator) (?&thirtydays) (?&separator) (?&year) (?!\d)
            |
            (?&febmonth) (?&separator) (?&twentyeightdays) (?&separator) (?&year) (?!\d)
        )
        (?<dmy>
            (?&twentyninedays) (?&separator) (?&febmonth) (?&separator) (?&leapyear)  (?!\d)
            |
            (?&thirtyonedays) (?&separator) (?&longmonthnumber) (?&separator)(?&year) (?!\d)
            |
            (?&thirtydays) (?&separator) (?&shortmonthnumber) (?&separator) (?&year) (?!\d)
            |
            (?&twentyeightdays) (?&separator) (?&febmonth) (?&separator)  (?&year) (?!\d)
        )
        (?<date>
            (?&ymd) | (?&mdy) | (?&dmy)
        )
        (?<exact_date>
           ^(?&date)$
       )
    )
}x;

my @test = ( "2009-02-29", "2009-02-28", "2004-02-28", "2004-02-29", "2005-03-31", "2005-04-31", "2005-05-31", 
    "28-02-2009","02-28-2009",        
);

for (@test) {
  if ( $_ =~ m/(?&exact_date) $date_syntax/x ) {
    print "$_ is valid\n";
  }
  else {
    print "$_ is not valid\n";
  }

  if ( $_ =~ m/^(?&ymd) $date_syntax/x ) {
    print "$_ is valid ymd\n";
  }
  else {
    print "$_ is not valid ymd\n";
  }


  if ( $_ =~ m/^(?&leapyear) $date_syntax/x ) {
    print "$_ is leap (start)\n";
  }
  else {
    print "$_ is not leap (start)\n";
  }

  print "\n";
}

请注意添加了 (?!\d) 片段，以便

"45" 不会与 ~= m{(?&twentyeightdays) $syntax} 匹配，因为 '4' 与 0?[4] 匹配

- Kent Fredric

这让我非常期待Perl6。 - Brad Gilbert

1

看起来这更像是一个如何一致缩进多行if条件的问题...有很多答案。真正重要的是一致性。如果您使用perltidy或其他格式化程序，请与其生成的内容保持一致（使用您的配置）。我会将正则表达式的内容从分隔符缩进一级。

您的帖子显示了通过类似Perl :: Critic之类的工具运行现有代码的一个主要缺陷 - 您 CPAN示例中省略了原始正则表达式中的*。如果您做了很多“清理”工作，可以预期会引入错误，因此我希望您有一个良好的测试套件。

- ysth

我漏掉了哪个“*”号？是的，我有一个小的测试套件来测试这个脚本。这个脚本只是为了搜索我的系统中安装的Perl模块，所以如果它出现问题也不会太严重，但是我明白您提到了清理现有代码的重要性。 - BrianH

哦 - 你在谈论那个缺少“*”的CPAN示例。我直接从http://search.cpan.org/~elliotjs/Perl-Critic-1.098/lib/Perl/Critic/Policy/RegularExpressions/RequireExtendedFormatting.pm中获取了它 - 这不是我的代码。但它确实说明了你的观点。 - BrianH

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Chas. Owens · Accepted Answer

不要写出与代码相同的注释，注释应该告诉你为什么代码写成这样。看一下这个可怕的例子，没有注释很难看出发生了什么，但是注释使得正在尝试匹配的内容变得清晰明了。

require 5.010;
my $sep         = qr{ [/.-] }x;               #allowed separators    
my $any_century = qr/ 1[6-9] | [2-9][0-9] /x; #match the century 
my $any_decade  = qr/ [0-9]{2} /x;            #match any decade or 2 digit year
my $any_year    = qr/ $any_century? $any_decade /x; #match a 2 or 4 digit year

#match the 1st through 28th for any month of any year
my $start_of_month = qr/
    (?:                         #match
        0?[1-9] |               #Jan - Sep or
        1[0-2]                  #Oct - Dec
    )
    ($sep)                      #the separator
    (?: 
        0?[1-9] |               # 1st -  9th or
        1[0-9]  |               #10th - 19th or
        2[0-8]                  #20th - 28th
    )
    \g{-1}                      #and the separator again
/x;

#match 28th - 31st for any month but Feb for any year
my $end_of_month = qr/
    (?:
        (?: 0?[13578] | 1[02] ) #match Jan, Mar, May, Jul, Aug, Oct, Dec
        ($sep)                  #the separator
        31                      #the 31st
        \g{-1}                  #and the separator again
        |                       #or
        (?: 0?[13-9] | 1[0-2] ) #match all months but Feb
        ($sep)                  #the separator
        (?:29|30)               #the 29th or the 30th
        \g{-1}                  #and the separator again
    )
/x;

#match any non-leap year date and the first part of Feb in leap years
my $non_leap_year = qr/ (?: $start_of_month | $end_of_month ) $any_year/x;

#match 29th of Feb in leap years
#BUG: 00 is treated as a non leap year
#even though 2000, 2400, etc are leap years
my $feb_in_leap = qr/
    0?2                         #match Feb
    ($sep)                      #the separtor
    29                          #the 29th
    \g{-1}                      #the separator again
    (?:
        $any_century?           #any century
        (?:                     #and decades divisible by 4 but not 100
            0[48]       | 
            [2468][048] |
            [13579][26]
        )
        |
        (?:                     #or match centuries that are divisible by 4
            16          | 
            [2468][048] |
            [3579][26]
        )
        00                      
    )
/x;

my $any_date  = qr/$non_leap_year|$feb_in_leap/;
my $only_date = qr/^$any_date$/;