正则表达式字符类的双重否定中存在漏洞吗？

Question

正则表达式字符类的双重否定中存在漏洞吗？

javaregex

22

更新：在Java 11中，下面描述的Bug似乎已经被修复了。

（可能甚至在之前就已经修复了，但我不知道确切的版本。关于类似问题的Bug报告链接在nhahtdh的回答中提到建议使用Java 9）。

TL;DR（修复之前）：
为什么[^\\D2]、[^[^0-9]2]、[^2[^0-9]]在Java中得到不同的结果？

用于测试的代码。现在可以跳过它。

String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" };
String[] tests = { "x", "1", "2", "3", "^", "[", "]" };

System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes);
System.out.println("-----------------------------------------------------------------------");
for (String test : tests)
    System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test,
            test.matches(regexes[0]), test.matches(regexes[1]),
            test.matches(regexes[2]), test.matches(regexes[3]),
            test.matches(regexes[4]), test.matches(regexes[5]));

假设我需要一个正则表达式，它可以接受以下字符：

非数字，
但2除外。

因此，这样的正则表达式应该表示除0，1，3，4，...，9之外的每个字符。我可以用至少两种方式编写它，这将是不是数字的所有内容与2的总和：

[[^0-9]2]
[\\D2]

这两个正则表达式都按预期工作。

match , [[^0-9]2] ,  [\D2]
--------------------------
    x ,      true ,   true
    1 ,     false ,  false
    2 ,      true ,   true
    3 ,     false ,  false
    ^ ,      true ,   true
    [ ,      true ,   true
    ] ,      true ,   true

现在假设我想要反转接受的字符。 （也就是说，我想要接受所有数字，除了2）我可以创建一个明确包含所有接受字符的正则表达式，如：

[013-9]

或者尝试通过将之前描述的两个正则表达式包装在另一个[^...]中来否定它们，如：

[^\\D2]
[^[^0-9]2]
甚至
[^2[^0-9]]

但令我惊讶的是，只有前两个版本按预期工作。

match | [[^0-9]2] ,  [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]] 
------+--------------------+------------------------------------------- 
    x |      true ,   true |   false ,  false ,       true ,       true 
    1 |     false ,  false |    true ,   true ,      false ,       true 
    2 |      true ,   true |   false ,  false ,      false ,      false 
    3 |     false ,  false |    true ,   true ,      false ,       true 
    ^ |      true ,   true |   false ,  false ,       true ,       true 
    [ |      true ,   true |   false ,  false ,       true ,       true 
    ] |      true ,   true |   false ,  false ,       true ,       true

所以我的问题是，为什么 [^[^0-9]2] 或 [^2[^0-9]] 的行为与 [^\D2] 不同？我能否以某种方式更正这些正则表达式，以便能够在其中使用 [^0-9]？

- Pshemo

1

只是提醒您，[^[^0-9]]与[0-9]不同。 - anubhava

@anubhava 谢谢。这非常有趣。从我看到的，它的行为与 [^0-9] 相同。你能解释一下吗？从我在帖子答案中读到的，它应该更像是 not nothing 的并集（可能代表所有具有 [^0-9] 的字符）。但在这种情况下，“Everything UNION Anything” 仍然是“Everything”。在这种情况下，似乎使用了交集而不是并集，这有点令人困惑。 - Pshemo

抱歉，我得去开会了。在嵌套的 [ 和 ] 中，我认为没有否定操作。因此，[^[^0-9]] 被解释为 ^ 和 [^0-9] 的并集，实际上等同于 [^0-9]。 - anubhava

如果[^[^0-9]]是^和[^0-9]的并集，那么^应该被[^[^0-9^]]匹配，但是"^".matches("[^[^0-9^]]")返回false :/（除非我误解了你的意思）。 - Pshemo

不太确定，因为 [^3[^2]] 匹配任何东西，即任何数字或非数字。 - anubhava

显示剩余7条评论

2个回答

15

根据JavaDoc页面，嵌套类产生两个类的联合，使用该符号无法创建交集:

要创建联合，只需将一个类嵌套在另一个类中，例如[0-4[6-8]]。这个特定的联合创建一个匹配数字0、1、2、3、4、6、7和8的单个字符类。

要创建交集，您必须使用&&:

要创建仅匹配所有嵌套类中公共字符的单个字符类，请使用&&，如[0-9&&[345]]。该特定交集创建一个仅匹配两个字符类中共同数字的单个字符类：3、4和5。

你问题的最后一部分对我来说仍然是个谜。[^2]和[^0-9]的联合确实应该是[^2]，因此[^2[^0-9]]的行为符合预期。[^[^0-9]2]的行为像[^0-9]确实很奇怪。

- Keppil

谢谢你的回答。这也是我一开始想到的。看起来在[^2[^0-9]]的情况下，首先创建了[^2]，然后正则表达式引擎使用联合将其与[^0-9]组合，因此它不会改变任何东西，因为这两个类的总和是[^2]（[^0-9]是[^2]的子集）。让我困扰的是为什么[^[^0-9]2]的行为与[^0-9]相同，但与[^2]不同？ - Pshemo

@Pshemo：稍微更新了一下答案。我在想，javadoc中的所有示例都将嵌套类作为最后一个元素。如果不遵循这个约定，行为会有点未定义吗？ - Keppil

这相当令人费解。也许如果将[]放在外部否定字符类的第一个元素位置，而不是使用联合交集运算符会有所帮助。德摩根定理可能与此有关，但我不确定如何将其与此情况联系起来。 - Pshemo

更让我困惑的是，[^[^0-9]2]、[^[^[^0-9]]2]和[^[^[^[^0-9]]]2]都产生相同的结果。我尝试查看代码，但它并不容易理解。 - Keppil

@Keppil 多层嵌套是否被支持？ - Martin Ender

@m.buettner：至少应该是2，因为文档中有这样的例子。然而，文档并没有说明关于嵌套的否定字符类的任何内容。 - nhahtdh

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- nhahtdh · Accepted Answer

Oracle的Pattern类中的字符类解析代码存在一些奇怪的巫术。如果您从Oracle网站下载JRE/JDK或者使用OpenJDK，则会自带该类。我没有检查其他JVM(尤其是GNU Classpath)的实现如何解析所提出的正则表达式。

从这一点开始，对于Pattern类及其内部工作的任何引用都严格限制为Oracle的实现(参考实现)。

阅读和理解Pattern类如何解析嵌套否定需要一些时间。然而，我已经编写了一个程序¹，使用Reflection API从Pattern对象中提取信息以查看编译结果。下面的输出是在Java HotSpot Client VM版本1.7.0_51上运行我的程序的结果。

^{1: 目前，这个程序是一团糟。当我完成并重构它后，我将在此帖子中更新一个链接。}

[^0-9]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

这里没有什么惊奇的地方。

[^[^0-9]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

[^[^[^0-9]]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

下面的2个案例编译成的程序与 [^0-9] 相同，这是很难理解的。

[[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

[\D2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Ctype. Match POSIX character class DIGIT (US-ASCII)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

在上述两种情况中，正如问题所述，没有什么奇怪的。

[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
    [U+0030][U+0031]
    01
  Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

[^\D2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
      Ctype. Match POSIX character class DIGIT (US-ASCII)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

这两个案例都符合问题中所述的预期结果。然而，请注意引擎如何对第一个字符类(\D)取补集，并将其应用于由剩余字符组成的字符类的差集。

[^[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

[^[^[^0-9]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

[^[^[^[^0-9]]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

正如Keppil在评论中确认的那样，上面的输出显示所有3个正则表达式都编译成了相同的程序！

[^2[^0-9]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
      [U+0032]
      2
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

我们使用UNION(NOT(2), NOT(0-9))代替NOT(UNION(2, NOT(0-9)))，这等同于NOT(2)，而不是0-13-9。

[^2[^[^0-9]]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
      [U+0032]
      2
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

正则表达式[^2[^[^0-9]]]由于相同的错误编译成与[^2[^0-9]]相同的程序。

存在一个未解决的Bug，似乎具有相同的性质：JDK-6609854。

解释

初步

在继续阅读之前，以下是Pattern类的实现细节，需要了解：

Pattern类将一个String编译成一系列节点，每个节点负责一个小而明确定义的任务，并将工作委托给链中的下一个节点。 Node类是所有节点的基类。
CharProperty类是所有字符类相关Node的基类。
BitClass类是CharProperty类的子类，它使用boolean[]数组加速匹配Latin-1字符（代码点≤255）。它有一个add方法，在编译期间允许添加字符。
CharProperty.complement、Pattern.union、Pattern.intersection是对应集合操作的方法。它们的作用是不言自明的。
Pattern.setDifference是asymmetric set difference。

乍一看解析字符类

在查看负责解析字符类的 CharProperty clazz(boolean consume) 方法的完整代码之前，让我们先看一个极其简化的代码版本，以了解代码的流程:

private CharProperty clazz(boolean consume) {
    // [Declaration and initialization of local variables - OMITTED]
    BitClass bits = new BitClass();
    int ch = next();
    for (;;) {
        switch (ch) {
            case '^':
                // Negates if first char in a class, otherwise literal
                if (firstInClass) {
                    // [CODE OMITTED]
                    ch = next();
                    continue;
                } else {
                    // ^ not first in class, treat as literal
                    break;
                }
            case '[':
                // [CODE OMITTED]
                ch = peek();
                continue;
            case '&':
                // [CODE OMITTED]
                continue;
            case 0:
                // [CODE OMITTED]
                // Unclosed character class is checked here
                break;
            case ']':
                // [CODE OMITTED]
                // The only return statement in this method
                // is in this case
                break;
            default:
                // [CODE OMITTED]
                break;
        }
        node = range(bits);

        // [CODE OMITTED]
        ch = peek();
    }
}

该代码基本上读取输入（将输入的String转换为以空终止int[]码点），直到遇到]或字符串结尾（未关闭字符类）为止。

该代码有点混乱，continue和break混在switch块中。但是，只要你意识到continue属于外部的for循环，break属于switch块，那么这段代码就很容易理解：

以continue结尾的情况永远不会执行switch语句后的代码。
以break结尾的情况可能会执行switch语句后的代码（如果它还没有return）。

通过上述观察，我们可以看到，每当找到一个非特殊字符并应包含在字符类中时，我们将执行switch语句后面的代码，在其中node = range(bits);是第一条语句。

如果你查看源代码，方法CharProperty range(BitClass bits)解析“字符类中的单个字符或字符范围”。该方法可以返回传入的相同BitClass对象（添加了新字符），也可以返回CharProperty类的新实例。

详细信息

接下来，让我们看一下完整版本的代码（省略了解析字符类交集&&的部分）：

private CharProperty clazz(boolean consume) {
    CharProperty prev = null;
    CharProperty node = null;
    BitClass bits = new BitClass();
    boolean include = true;
    boolean firstInClass = true;
    int ch = next();
    for (;;) {
        switch (ch) {
            case '^':
                // Negates if first char in a class, otherwise literal
                if (firstInClass) {
                    if (temp[cursor-1] != '[')
                        break;
                    ch = next();
                    include = !include;
                    continue;
                } else {
                    // ^ not first in class, treat as literal
                    break;
                }
            case '[':
                firstInClass = false;
                node = clazz(true);
                if (prev == null)
                    prev = node;
                else
                    prev = union(prev, node);
                ch = peek();
                continue;
            case '&':
                // [CODE OMITTED]
                // There are interesting things (bugs) here,
                // but it is not relevant to the discussion.
                continue;
            case 0:
                firstInClass = false;
                if (cursor >= patternLength)
                    throw error("Unclosed character class");
                break;
            case ']':
                firstInClass = false;

                if (prev != null) {
                    if (consume)
                        next();

                    return prev;
                }
                break;
            default:
                firstInClass = false;
                break;
        }
        node = range(bits);

        if (include) {
            if (prev == null) {
                prev = node;
            } else {
                if (prev != node)
                    prev = union(prev, node);
            }
        } else {
            if (prev == null) {
                prev = node.complement();
            } else {
                if (prev != node)
                    prev = setDifference(prev, node);
            }
        }
        ch = peek();
    }
}

查看 switch 语句中的 case '[': 代码和语句后面的代码：

node 变量存储解析单元（独立字符、字符范围、简写字符类、POSIX/Unicode 字符类或嵌套字符类）的结果。
prev 变量存储到目前为止的编译结果，并且总是在我们编译 node 中的一个单元后进行更新。

由于本地变量 boolean include，记录字符类是否被否定，从未传递给任何方法调用，因此它只能在此方法中起作用。而且，include 唯一被读取和处理的地方是在 switch 语句之后。

正则表达式字符类的双重否定中存在漏洞吗？

更新：在Java 11中，下面描述的Bug似乎已经被修复了。

（可能甚至在之前就已经修复了，但我不知道确切的版本。关于类似问题的Bug报告链接在nhahtdh的回答中提到建议使用Java 9）。

解释

初步

乍一看解析字符类

详细信息

正在建设中的文章