Oracle的
Pattern
类中的字符类解析代码存在一些奇怪的巫术。如果您从Oracle网站下载JRE/JDK或者使用OpenJDK,则会自带该类。我没有检查其他JVM(尤其是
GNU Classpath)的实现如何解析所提出的正则表达式。
从这一点开始,对于
Pattern
类及其内部工作的任何引用都严格限制为Oracle的实现(参考实现)。
阅读和理解
Pattern
类如何解析嵌套否定需要一些时间。然而,我已经编写了一个程序
1,使用
Reflection API从
Pattern
对象中提取信息以查看编译结果。下面的输出是在Java HotSpot Client VM版本1.7.0_51上运行我的程序的结果。
1: 目前,这个程序是一团糟。当我完成并重构它后,我将在此帖子中更新一个链接。
[^0-9]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
这里没有什么惊奇的地方。
[^[^0-9]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
[^[^[^0-9]]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
下面的2个案例编译成的程序与
[^0-9]
相同,这是很难理解的。
[[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[\D2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Ctype. Match POSIX character class DIGIT (US-ASCII)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
在上述两种情况中,正如问题所述,没有什么奇怪的。
[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
[U+0030][U+0031]
01
Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
[^\D2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Ctype. Match POSIX character class DIGIT (US-ASCII)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
这两个案例都符合问题中所述的预期结果。然而,请注意引擎如何对第一个字符类(
\D
)取补集,并将其应用于由剩余字符组成的字符类的差集。
[^[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[^[^[^0-9]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
[^[^[^[^0-9]]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
LastNode
Node. Accept match
正如Keppil在评论中确认的那样,上面的输出显示所有3个正则表达式都编译成了相同的程序!
[^2[^0-9]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
我们使用UNION(NOT(2), NOT(0-9))
代替NOT(UNION(2, NOT(0-9)))
,这等同于NOT(2)
,而不是0-13-9
。
[^2[^[^0-9]]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
[U+0032]
2
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match
正则表达式
[^2[^[^0-9]]]
由于相同的错误编译成与
[^2[^0-9]]
相同的程序。
存在一个未解决的Bug,似乎具有相同的性质:
JDK-6609854。
解释
初步
在继续阅读之前,以下是Pattern
类的实现细节,需要了解:
Pattern
类将一个String
编译成一系列节点,每个节点负责一个小而明确定义的任务,并将工作委托给链中的下一个节点。 Node
类是所有节点的基类。
CharProperty
类是所有字符类相关Node
的基类。
BitClass
类是CharProperty
类的子类,它使用boolean[]
数组加速匹配Latin-1字符(代码点≤255)。它有一个add
方法,在编译期间允许添加字符。
CharProperty.complement
、Pattern.union
、Pattern.intersection
是对应集合操作的方法。它们的作用是不言自明的。
Pattern.setDifference
是asymmetric set difference。
乍一看解析字符类
在查看负责解析字符类的 CharProperty clazz(boolean consume)
方法的完整代码之前,让我们先看一个极其简化的代码版本,以了解代码的流程:
private CharProperty clazz(boolean consume) {
BitClass bits = new BitClass();
int ch = next();
for (;;) {
switch (ch) {
case '^':
if (firstInClass) {
ch = next();
continue;
} else {
break;
}
case '[':
ch = peek();
continue;
case '&':
continue;
case 0:
break;
case ']':
break;
default:
break;
}
node = range(bits);
ch = peek();
}
}
该代码基本上读取输入(将输入的
String
转换为以
空终止int[]
码点),直到遇到
]
或字符串结尾(未关闭字符类)为止。
该代码有点混乱,
continue
和
break
混在
switch
块中。但是,只要你意识到
continue
属于外部的
for
循环,
break
属于
switch
块,那么这段代码就很容易理解:
- 以
continue
结尾的情况永远不会执行switch
语句后的代码。
- 以
break
结尾的情况可能会执行switch
语句后的代码(如果它还没有return
)。
通过上述观察,我们可以看到,每当
找到一个非特殊字符并应包含在字符类中时,我们将执行
switch
语句后面的代码,在其中
node = range(bits);
是第一条语句。
如果你查看
源代码,方法
CharProperty range(BitClass bits)
解析“字符类中的单个字符或字符范围”。该方法可以返回传入的相同
BitClass
对象(添加了新字符),也可以返回
CharProperty
类的新实例。
详细信息
接下来,让我们看一下完整版本的代码(省略了解析字符类交集
&&
的部分):
private CharProperty clazz(boolean consume) {
CharProperty prev = null;
CharProperty node = null;
BitClass bits = new BitClass();
boolean include = true;
boolean firstInClass = true;
int ch = next();
for (;;) {
switch (ch) {
case '^':
if (firstInClass) {
if (temp[cursor-1] != '[')
break;
ch = next();
include = !include;
continue;
} else {
break;
}
case '[':
firstInClass = false;
node = clazz(true);
if (prev == null)
prev = node;
else
prev = union(prev, node);
ch = peek();
continue;
case '&':
continue;
case 0:
firstInClass = false;
if (cursor >= patternLength)
throw error("Unclosed character class");
break;
case ']':
firstInClass = false;
if (prev != null) {
if (consume)
next();
return prev;
}
break;
default:
firstInClass = false;
break;
}
node = range(bits);
if (include) {
if (prev == null) {
prev = node;
} else {
if (prev != node)
prev = union(prev, node);
}
} else {
if (prev == null) {
prev = node.complement();
} else {
if (prev != node)
prev = setDifference(prev, node);
}
}
ch = peek();
}
}
查看 switch
语句中的 case '[':
代码和语句后面的代码:
node
变量存储解析 单元(独立字符、字符范围、简写字符类、POSIX/Unicode 字符类或嵌套字符类)的结果。
prev
变量存储到目前为止的编译结果,并且总是在我们编译 node
中的一个 单元 后进行更新。
由于本地变量 boolean include
,记录字符类是否被否定,从未传递给任何方法调用,因此它只能在此方法中起作用。 而且,include
唯一被读取和处理的地方是在 switch
语句之后。
正在建设中的文章
[^[^0-9]]
与[0-9]
不同。 - anubhava[^0-9]
相同。你能解释一下吗?从我在帖子答案中读到的,它应该更像是not nothing
的并集(可能代表所有具有[^0-9]
的字符)。但在这种情况下,“Everything UNION Anything” 仍然是“Everything”。在这种情况下,似乎使用了交集而不是并集,这有点令人困惑。 - Pshemo[ 和 ]
中,我认为没有否定操作。因此,[^[^0-9]]
被解释为^
和[^0-9]
的并集,实际上等同于[^0-9]
。 - anubhava[^[^0-9]]
是^
和[^0-9]
的并集,那么^
应该被[^[^0-9^]]
匹配,但是"^".matches("[^[^0-9^]]")
返回false :/(除非我误解了你的意思)。 - Pshemo[^3[^2]]
匹配任何东西,即任何数字或非数字。 - anubhava