Java Matcher groups: 理解"(?:X|Y)"和"(?:X)|(?:Y)"之间的区别

7

有人能解释一下:

  1. 为什么下面使用的两个模式会给出不同的结果?(已在下面回答)
  2. 为什么第二个例子只给出了一个组计数,但是说组1的起始和结束位置都是-1?
 public void testGroups() throws Exception
 {
  String TEST_STRING = "After Yes is group 1 End";
  {
   Pattern p;
   Matcher m;
   String pattern="(?:Yes|No)(.*)End";
   p=Pattern.compile(pattern);
   m=p.matcher(TEST_STRING);
   boolean f=m.find();
   int count=m.groupCount();
   int start=m.start(1);
   int end=m.end(1);

   System.out.println("Pattern=" + pattern + "\t Found=" + f + " Group count=" + count + 
     " Start of group 1=" + start + " End of group 1=" + end );
  }

  {
   Pattern p;
   Matcher m;

   String pattern="(?:Yes)|(?:No)(.*)End";
   p=Pattern.compile(pattern);
   m=p.matcher(TEST_STRING);
   boolean f=m.find();
   int count=m.groupCount();
   int start=m.start(1);
   int end=m.end(1);

   System.out.println("Pattern=" + pattern + "\t Found=" + f + " Group count=" + count + 
     " Start of group 1=" + start + " End of group 1=" + end );
  }

 }

输出结果如下:

Pattern=(?:Yes|No)(.*)End  Found=true Group count=1 Start of group 1=9 End of group 1=21
Pattern=(?:Yes)|(?:No)(.*)End  Found=true Group count=1 Start of group 1=-1 End of group 1=-1
4个回答

9
  1. The difference is that in the second pattern "(?:Yes)|(?:No)(.*)End", the concatenation ("X followed by Y" in "XY") has higher precedence than the choice ("Either X or Y" in "X|Y"), like multiplication has higher precedence than addition, so the pattern is equivalent to

    "(?:Yes)|(?:(?:No)(.*)End)"
    

    What you wanted to get is the following pattern:

    "(?:(?:Yes)|(?:No))(.*)End"
    

    This yields the same output as your first pattern.

    In your test, the second pattern has the group 1 at the (empty) range [-1, -1[ because that group did not match (the start -1 is included, the end -1 is excluded, making the half-open interval empty).

  2. A capturing group is a group that may capture input. If it captures, one also says it matches some substring of the input. If the regex contains choices, then not every capturing group may actually capture input, so there may be groups that do not match even if the regex matches.

  3. The group count, as returned by Matcher.groupCount(), is gained purely by counting the grouping brackets of capturing groups, irrespective of whether any of them could match on any given input. Your pattern has exactly one capturing group: (.*). This is group 1. The documentation states:

    (?:X)    X, as a non-capturing group
    

    and explains:

    Groups beginning with (? are either pure, non-capturing groups that do not capture text and do not count towards the group total, or named-capturing group.

    Whether any specific group matches on a given input, is irrelevant for that definition. E.g., in the pattern (Yes)|(No), there are two groups ((Yes) is group 1, (No) is group 2), but only one of them can match for any given input.

  4. The call to Matcher.find() returns true if the regex was matched on some substring. You can determine which groups matched by looking at their start: If it is -1, then the group did not match. In that case, the end is at -1, too. There is no built-in method that tells you how many capturing groups actually matched after a call to find() or match(). You'd have to count these yourself by looking at each group's start.

  5. When it comes to backreferences, also note what the regex tutorial has to say:

    There is a difference between a backreference to a capturing group that matched nothing, and one to a capturing group that did not participate in the match at all.


谢谢您的回答。我仍然想了解为什么组计数为1。我理解(来自文档和其他实验)1个组计数应该意味着找到了一个单独的编号组,因此start(1)应该> -1。 - user358795
群组计数纯粹是通过计算分组括号来得到的,而您的模式正好有一个:(.*)。这是第一组。任何特定组是否与给定输入匹配,对于该定义是不相关的。例如,在模式 "(Yes)|(No)" 中,有两个组("(Yes)" 是第一组,"(No)" 是第二组),但只有其中一个能够匹配任何给定的输入。 - Christian Semrau
1
那么您的意思是,当文档中说“返回此匹配器模式中捕获组的数量”时,即使没有匹配,它也表示表达式中的计数?在这种情况下,为什么调用find()会返回true?或者换句话说,如何确定是否有任何组匹配,如果有,有多少个? - user358795

5

总结一下:

1)由于运算符的优先级规则不同,两个模式会产生不同的结果。

  • (?:Yes|No)(.*)End 匹配 (Yes或 No) 后接 .*End
  • (?:Yes)|(?:No)(.*)End 匹配 (Yes) 或 (No后接 .*End)

2)第二个模式给出了组数为1,但是开始和结束为-1,因为Matcher方法调用的结果具有(不一定直观的)含义。

  • Matcher.find() 如果找到匹配,则返回true。在您的情况下,匹配是在模式的(?:Yes)部分进行的。
  • Matcher.groupCount() 返回模式中捕获组的数量,无论捕获组是否实际参与匹配。在您的情况下,只有非捕获(?:Yes)部分参与了匹配,但捕获(.*)组仍然是模式的一部分,因此组数为1。
  • Matcher.start(n)Matcher.end(n) 返回n th捕获组匹配的子序列的开始和结束索引。在您的情况下,虽然找到了一个整体匹配,但是(.*)捕获组没有参与匹配,因此没有捕获子序列,因此结果为-1。

3)(在评论中提出的问题。)为了确定有多少个捕获组实际上捕获了子序列,请从0到Matcher.groupCount()迭代Matcher.start(n),计算非-1结果的数量。(请注意,Matcher.start(0) 是表示整个模式的捕获组,您可能希望将其排除在外。)


3

由于模式中“|”运算符的优先级,第二个模式等同于:

(?:Yes)|((?:No)(.*)End)

你需要的是:

(?:(?:Yes)|(?:No))(.*)End

你在groupCount上是错误的,因为Javadoc清楚地解释道:“约定俗成,第零组表示整个模式。它不包含在此计数中。” 这是不直观的。 - Mark Peters

1

在使用正则表达式时,重要的是要记住有一个隐含的 AND 运算符在起作用。这可以从覆盖逻辑运算符的 java.util.regex.Pattern 的 JavaDoc 中看到:

逻辑运算符
XY X 后跟 Y
X|Y 要么是 X 要么是 Y
(X) X,作为捕获组

这个 AND 优先于第二个模式中的 OR。第二个模式等价于
(?:Yes)|(?:(?:No)(.*)End).
为了使它等价于第一个模式,必须将其更改为
(?:(?:Yes)|(?:No))(.*)End


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接