正则表达式匹配可选数字

Question

正则表达式匹配可选数字

c++regex

4

我有一个文本文件，目前正在使用正则表达式解析，并且它工作得很好。文件格式已经定义好了，是两个数字，中间用任何空格隔开，后面可以跟一个可选的注释。

现在，我们需要向这个文件添加一个额外的（但可选的）第三个数字，使格式为2或3个数字用空格隔开并带有注释的情况。

我有一个正则表达式对象，至少匹配所有必要的行格式，但如果存在第三个（可选的）数字，我无法捕获它。

代码：

#include <iostream>
#include <regex>
#include <vector>
#include <string>
#include <cassert>
using namespace std;

bool regex_check(const std::string& in)
{
   std::regex check{
      "[[:space:]]*?"                    // eat leading spaces
      "([[:digit:]]+)"                   // capture 1st number
      "[[:space:]]*?"                    // each second set of spaces
      "([[:digit:]]+)"                   // capture 2nd number
      "[[:space:]]*?"                    // eat more spaces
      "([[:digit:]]+|[[:space:]]*?)"     // optionally, capture 3rd number
      "!*?"                              // Anything after '!' is a comment
      ".*?"                              // eat rest of line
   };

   std::smatch match;

   bool result = std::regex_match(in, match, check);

   for(auto m : match)
   {
      std::cout << "  [" << m << "]\n";
   }

   return result;
}

int main()
{
   std::vector<std::string> to_check{
      "  12  3",
      "  1  2 ",
      "  12  3 !comment",
      "  1  2     !comment ",
      "\t1\t1",
      "\t  1\t  1\t !comment   \t",
      " 16653    2      1",
      " 16654    2      1 ",
      " 16654    2      1   !    comment",
      "\t16654\t\t2\t   1\t ! comment\t\t",
   };

   for(auto s : to_check)
   {
      assert(regex_check(s));
   }

   return 0;
}

这将产生以下输出：

  [  12  3]
  [12]
  [3]
  []
  [  1  2 ]
  [1]
  [2]
  []
  [  12  3 !comment]
  [12]
  [3]
  []
  [  1  2     !comment ]
  [1]
  [2]
  []
  [ 1   1]
  [1]
  [1]
  []
  [   1   1  !comment       ]
  [1]
  [1]
  []
  [ 16653    2      1]
  [16653]
  [2]
  []
  [ 16654    2      1 ]
  [16654]
  [2]
  []
  [ 16654    2      1   !    comment]
  [16654]
  [2]
  []
  [ 16654       2      1     ! comment      ]
  [16654]
  [2]
  []

正如您所看到的，这段代码可以匹配所有期望的输入格式，但却无法捕获第三个数字，即使它存在。

我目前正在使用GCC 5.1.1进行测试，但实际目标编译器将是GCC 4.8.2，使用boost :: regex代替std :: regex。

- Chad

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Lucas Trzesniewski · Accepted Answer

让我们对以下示例进行逐步处理。

 16653    2      1
^

^ 是当前匹配的偏移量。此时，我们在模式中的这里：

\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
^

为了简化表示，我将[[:space:]]简化为\s，将[[:digit:]]简化为\d。

\s*?匹配并且随后匹配(\d+)。我们得到以下结果：

 16653    2      1
      ^

\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
         ^

同样的事情：\s*? 匹配，然后 (\d+) 匹配。状态是：

 16653    2      1
           ^

\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
                  ^

现在，事情变得更加棘手。

你在这里有一个\s*?，是一个“懒惰”的量词。引擎试图不匹配任何内容，并查看其余的模式是否匹配。所以它尝试着进行交替。

第一个选择是\d+，但它失败了，因为在这个位置上没有数字。

第二个选择是\s*?，在此之后没有其他选择。它是懒惰的，所以让我们首先尝试匹配空字符串。

下一个标记是!*?，但它也匹配空字符串，然后它后面跟着.*?，它将匹配到字符串结尾的所有内容（因为你使用regex_match - 如果使用regex_search，它将匹配空字符串）。

在这一点上，你已经成功地到达了模式的末尾，并且你得到了一个匹配，而不必强制对字符串进行\d+匹配。

问题在于，整个模式的这一部分最终是可选的：

\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
                  \__________________/

所以，你可以这样重写你的模式：

\s*?(\d+)\s+(\d+)(?:\s+(\d+))?\s*(?:!.*)?

演示（添加了锚点以模拟regex_match的行为）

这样，您就强制正则表达式引擎考虑 \d ，而不是使用空字符串进行懒惰匹配。由于\s和\d不相交，因此不需要懒惰量词。

!*?.*? 也是次优的，因为 !*? 已经被以下的 .*? 覆盖。我将其重写为 (?:!.*)? 来要求在注释开头需要一个 ! ，如果没有这个标志，就会匹配失败。