正则表达式字符类的双重否定中存在漏洞吗?

22

更新:在Java 11中,下面描述的Bug似乎已经被修复了。

(可能甚至在之前就已经修复了,但我不知道确切的版本。关于类似问题的Bug报告链接在nhahtdh的回答中提到建议使用Java 9)。


TL;DR(修复之前):
为什么[^\\D2][^[^0-9]2][^2[^0-9]]在Java中得到不同的结果?


用于测试的代码。现在可以跳过它。

String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" };
String[] tests = { "x", "1", "2", "3", "^", "[", "]" };

System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes);
System.out.println("-----------------------------------------------------------------------");
for (String test : tests)
    System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test,
            test.matches(regexes[0]), test.matches(regexes[1]),
            test.matches(regexes[2]), test.matches(regexes[3]),
            test.matches(regexes[4]), test.matches(regexes[5]));

假设我需要一个正则表达式,它可以接受以下字符:

  • 非数字,
  • 2除外。

因此,这样的正则表达式应该表示除0134,...,9之外的每个字符。 我可以用至少两种方式编写它,这将是不是数字的所有内容与2的总和:

  • [[^0-9]2]
  • [\\D2]

这两个正则表达式都按预期工作。

match , [[^0-9]2] ,  [\D2]
--------------------------
    x ,      true ,   true
    1 ,     false ,  false
    2 ,      true ,   true
    3 ,     false ,  false
    ^ ,      true ,   true
    [ ,      true ,   true
    ] ,      true ,   true

现在假设我想要反转接受的字符。 (也就是说,我想要接受所有数字,除了2) 我可以创建一个明确包含所有接受字符的正则表达式,如:

  • [013-9]

或者尝试通过将之前描述的两个正则表达式包装在另一个[^...]中来否定它们,如:

  • [^\\D2]
  • [^[^0-9]2]
    甚至
  • [^2[^0-9]]

但令我惊讶的是,只有前两个版本按预期工作。

match | [[^0-9]2] ,  [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]] 
------+--------------------+------------------------------------------- 
    x |      true ,   true |   false ,  false ,       true ,       true 
    1 |     false ,  false |    true ,   true ,      false ,       true 
    2 |      true ,   true |   false ,  false ,      false ,      false 
    3 |     false ,  false |    true ,   true ,      false ,       true 
    ^ |      true ,   true |   false ,  false ,       true ,       true 
    [ |      true ,   true |   false ,  false ,       true ,       true 
    ] |      true ,   true |   false ,  false ,       true ,       true 
所以我的问题是,为什么 [^[^0-9]2][^2[^0-9]] 的行为与 [^\D2] 不同?我能否以某种方式更正这些正则表达式,以便能够在其中使用 [^0-9]

1
只是提醒您,[^[^0-9]][0-9]不同。 - anubhava
@anubhava 谢谢。这非常有趣。从我看到的,它的行为与 [^0-9] 相同。你能解释一下吗?从我在帖子答案中读到的,它应该更像是 not nothing 的并集(可能代表所有具有 [^0-9] 的字符)。但在这种情况下,“Everything UNION Anything” 仍然是“Everything”。在这种情况下,似乎使用了交集而不是并集,这有点令人困惑。 - Pshemo
抱歉,我得去开会了。在嵌套的 [ 和 ] 中,我认为没有否定操作。因此,[^[^0-9]] 被解释为 ^[^0-9] 的并集,实际上等同于 [^0-9] - anubhava
如果[^[^0-9]]^[^0-9]的并集,那么^应该被[^[^0-9^]]匹配,但是"^".matches("[^[^0-9^]]")返回false :/(除非我误解了你的意思)。 - Pshemo
不太确定,因为 [^3[^2]] 匹配任何东西,即任何数字或非数字。 - anubhava
显示剩余7条评论
2个回答

17
Oracle的Pattern类中的字符类解析代码存在一些奇怪的巫术。如果您从Oracle网站下载JRE/JDK或者使用OpenJDK,则会自带该类。我没有检查其他JVM(尤其是GNU Classpath)的实现如何解析所提出的正则表达式。
从这一点开始,对于Pattern类及其内部工作的任何引用都严格限制为Oracle的实现(参考实现)。
阅读和理解Pattern类如何解析嵌套否定需要一些时间。然而,我已经编写了一个程序1,使用Reflection APIPattern对象中提取信息以查看编译结果。下面的输出是在Java HotSpot Client VM版本1.7.0_51上运行我的程序的结果。

1: 目前,这个程序是一团糟。当我完成并重构它后,我将在此帖子中更新一个链接。

[^0-9]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

这里没有什么惊奇的地方。

[^[^0-9]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

[^[^[^0-9]]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

下面的2个案例编译成的程序与 [^0-9] 相同,这是很难理解的。
[[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

[\D2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Ctype. Match POSIX character class DIGIT (US-ASCII)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

在上述两种情况中,正如问题所述,没有什么奇怪的。
[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
    [U+0030][U+0031]
    01
  Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

[^\D2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
      Ctype. Match POSIX character class DIGIT (US-ASCII)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

这两个案例都符合问题中所述的预期结果。然而,请注意引擎如何对第一个字符类(\D)取补集,并将其应用于由剩余字符组成的字符类的差集。
[^[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

[^[^[^0-9]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

[^[^[^[^0-9]]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

正如Keppil在评论中确认的那样,上面的输出显示所有3个正则表达式都编译成了相同的程序!

[^2[^0-9]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
      [U+0032]
      2
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

我们使用UNION(NOT(2), NOT(0-9))代替NOT(UNION(2, NOT(0-9))),这等同于NOT(2),而不是0-13-9

[^2[^[^0-9]]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
      [U+0032]
      2
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

正则表达式[^2[^[^0-9]]]由于相同的错误编译成与[^2[^0-9]]相同的程序。
存在一个未解决的Bug,似乎具有相同的性质:JDK-6609854

解释

初步

在继续阅读之前,以下是Pattern类的实现细节,需要了解:

  • Pattern类将一个String编译成一系列节点,每个节点负责一个小而明确定义的任务,并将工作委托给链中的下一个节点。 Node类是所有节点的基类。
  • CharProperty类是所有字符类相关Node的基类。
  • BitClass类是CharProperty类的子类,它使用boolean[]数组加速匹配Latin-1字符(代码点≤255)。它有一个add方法,在编译期间允许添加字符。
  • CharProperty.complementPattern.unionPattern.intersection是对应集合操作的方法。它们的作用是不言自明的。
  • Pattern.setDifferenceasymmetric set difference

乍一看解析字符类

在查看负责解析字符类的 CharProperty clazz(boolean consume) 方法的完整代码之前,让我们先看一个极其简化的代码版本,以了解代码的流程:

private CharProperty clazz(boolean consume) {
    // [Declaration and initialization of local variables - OMITTED]
    BitClass bits = new BitClass();
    int ch = next();
    for (;;) {
        switch (ch) {
            case '^':
                // Negates if first char in a class, otherwise literal
                if (firstInClass) {
                    // [CODE OMITTED]
                    ch = next();
                    continue;
                } else {
                    // ^ not first in class, treat as literal
                    break;
                }
            case '[':
                // [CODE OMITTED]
                ch = peek();
                continue;
            case '&':
                // [CODE OMITTED]
                continue;
            case 0:
                // [CODE OMITTED]
                // Unclosed character class is checked here
                break;
            case ']':
                // [CODE OMITTED]
                // The only return statement in this method
                // is in this case
                break;
            default:
                // [CODE OMITTED]
                break;
        }
        node = range(bits);

        // [CODE OMITTED]
        ch = peek();
    }
}

该代码基本上读取输入(将输入的String转换为以空终止int[]码点),直到遇到]或字符串结尾(未关闭字符类)为止。
该代码有点混乱,continuebreak混在switch块中。但是,只要你意识到continue属于外部的for循环,break属于switch块,那么这段代码就很容易理解:
  • continue结尾的情况永远不会执行switch语句后的代码。
  • break结尾的情况可能会执行switch语句后的代码(如果它还没有return)。
通过上述观察,我们可以看到,每当找到一个非特殊字符并应包含在字符类中时,我们将执行switch语句后面的代码,在其中node = range(bits);是第一条语句。
如果你查看源代码,方法CharProperty range(BitClass bits)解析“字符类中的单个字符或字符范围”。该方法可以返回传入的相同BitClass对象(添加了新字符),也可以返回CharProperty类的新实例。

详细信息

接下来,让我们看一下完整版本的代码(省略了解析字符类交集&&的部分):
private CharProperty clazz(boolean consume) {
    CharProperty prev = null;
    CharProperty node = null;
    BitClass bits = new BitClass();
    boolean include = true;
    boolean firstInClass = true;
    int ch = next();
    for (;;) {
        switch (ch) {
            case '^':
                // Negates if first char in a class, otherwise literal
                if (firstInClass) {
                    if (temp[cursor-1] != '[')
                        break;
                    ch = next();
                    include = !include;
                    continue;
                } else {
                    // ^ not first in class, treat as literal
                    break;
                }
            case '[':
                firstInClass = false;
                node = clazz(true);
                if (prev == null)
                    prev = node;
                else
                    prev = union(prev, node);
                ch = peek();
                continue;
            case '&':
                // [CODE OMITTED]
                // There are interesting things (bugs) here,
                // but it is not relevant to the discussion.
                continue;
            case 0:
                firstInClass = false;
                if (cursor >= patternLength)
                    throw error("Unclosed character class");
                break;
            case ']':
                firstInClass = false;

                if (prev != null) {
                    if (consume)
                        next();

                    return prev;
                }
                break;
            default:
                firstInClass = false;
                break;
        }
        node = range(bits);

        if (include) {
            if (prev == null) {
                prev = node;
            } else {
                if (prev != node)
                    prev = union(prev, node);
            }
        } else {
            if (prev == null) {
                prev = node.complement();
            } else {
                if (prev != node)
                    prev = setDifference(prev, node);
            }
        }
        ch = peek();
    }
}

查看 switch 语句中的 case '[': 代码和语句后面的代码:

  • node 变量存储解析 单元(独立字符、字符范围、简写字符类、POSIX/Unicode 字符类或嵌套字符类)的结果。
  • prev 变量存储到目前为止的编译结果,并且总是在我们编译 node 中的一个 单元 后进行更新。

由于本地变量 boolean include,记录字符类是否被否定,从未传递给任何方法调用,因此它只能在此方法中起作用。 而且,include 唯一被读取和处理的地方是在 switch 语句之后。

正在建设中的文章


那很有趣。谢谢您的输入。 - Pshemo
2
@Pshemo:我写信给core-lib-dev,结果发现这个问题已经在2011年讨论过,并且已经提出了一个补丁。我在我写给他们的邮件中更详细地解释了当前的行为。我应该更新这篇帖子并加入这些信息吗? - nhahtdh
我本来想早点接受你的答案,但不幸的是,我看到这些双重否定的情况是如何处理的,却没有解释“为什么要这样处理”,感觉这个答案还不够完整。有了这部分内容,我可能会更好地理解它。谢谢你的时间。 - Pshemo
1
@Pshemo:你可以慢慢来(我也会慢慢来,因为我现在有点累)。 - nhahtdh
2
@user1803551:这里的程序 https://github.com/nhahtdh/pattern-dissector (它比我的当前代码落后很多,但应该会输出类似于上面的内容)。然而,我没有时间完成其余的分析。需要6-8小时的专注工作(这是半开玩笑,但也有一半是真的)。 - nhahtdh
显示剩余2条评论

15
根据JavaDoc页面,嵌套类产生两个类的联合,使用该符号无法创建交集:

要创建联合,只需将一个类嵌套在另一个类中,例如[0-4[6-8]]。这个特定的联合创建一个匹配数字0、1、2、3、4、6、7和8的单个字符类。

要创建交集,您必须使用&&:

要创建仅匹配所有嵌套类中公共字符的单个字符类,请使用&&,如[0-9&&[345]]。该特定交集创建一个仅匹配两个字符类中共同数字的单个字符类:3、4和5。

你问题的最后一部分对我来说仍然是个谜。[^2][^0-9]的联合确实应该是[^2],因此[^2[^0-9]]的行为符合预期。[^[^0-9]2]的行为像[^0-9]确实很奇怪。


谢谢你的回答。这也是我一开始想到的。看起来在[^2[^0-9]]的情况下,首先创建了[^2],然后正则表达式引擎使用联合将其与[^0-9]组合,因此它不会改变任何东西,因为这两个类的总和是[^2][^0-9][^2]的子集)。让我困扰的是为什么[^[^0-9]2]的行为与[^0-9]相同,但与[^2]不同? - Pshemo
@Pshemo:稍微更新了一下答案。我在想,javadoc中的所有示例都将嵌套类作为最后一个元素。如果不遵循这个约定,行为会有点未定义吗? - Keppil
这相当令人费解。也许如果将[]放在外部否定字符类的第一个元素位置,而不是使用联合交集运算符会有所帮助。德摩根定理可能与此有关,但我不确定如何将其与此情况联系起来。 - Pshemo
更让我困惑的是,[^[^0-9]2][^[^[^0-9]]2][^[^[^[^0-9]]]2]都产生相同的结果。我尝试查看代码,但它并不容易理解。 - Keppil
@Keppil 多层嵌套是否被支持? - Martin Ender
@m.buettner:至少应该是2,因为文档中有这样的例子。然而,文档并没有说明关于嵌套的否定字符类的任何内容。 - nhahtdh

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接