当在正则表达式中使用\X，它匹配什么？

Question

当在正则表达式中使用\X，它匹配什么？

regexunicode

13

根据 http://www.regular-expressions.info，

你可以将\X视为在使用普通ASCII的正则表达式引擎中使用的点的Unicode版本。

这是否意味着它将匹配任何可能的Unicode代码点？

- federico-t

2个回答

7

来自Perl regex手册:

这匹配Unicode 扩展字形群集。

\X很好地匹配了正常（非Unicode程序员）使用所认为的单个字符。例如，考虑带有某种变音符号（如箭头）的G。在Unicode中没有这样的单个字符，但可以通过使用G后面跟一个Unicode“组合向上箭头下方”来组成一个字符，并且会由支持Unicode的软件显示为单个字符。

助记符：扩展Unicode字符。

还有来自PCRE man pages（2012年）:

PCRE implements a simpler version of \X than Perl, which changed to make \X match what Unicode calls an "extended grapheme cluster". This is more complicated than an extended Unicode sequence, which is what PCRE matches.

[...]

\X an extended Unicode sequence

[...]

The \X escape matches any number of Unicode characters that form an extended Unicode sequence. \X is equivalent to
(?>\PM\pM*)
That is, it matches a character without the "mark" property, followed by zero or more characters with the "mark" property, and treats the sequence as an atomic group (see below). Characters with the "mark" property are typically accents that affect the preceding character. None of them have codepoints less than 256, so in 8-bit non-UTF-8 mode \X matches any one character.

Note that recent versions of Perl have changed \X to match what Unicode calls an "extended grapheme cluster", which has a more complicated definition.

PCRE手册的后续版本（2015年）：

Extended grapheme clusters

The \X escape matches any number of Unicode characters that form an "extended grapheme cluster", and treats the sequence as an atomic group (see below). Up to and including release 8.31, PCRE matched an ear- lier, simpler definition that was equivalent to
(?>\PM\pM*)
That is, it matched a character without the "mark" property, followed by zero or more characters with the "mark" property. Characters with the "mark" property are typically non-spacing accents that affect the preceding character.

This simple definition was extended in Unicode to include more compli- cated kinds of composite character by giving each character a grapheme breaking property, and creating rules that use these properties to define the boundaries of extended grapheme clusters. In releases of PCRE later than 8.31, \X matches one of these clusters.

\X always matches at least one character. Then it decides whether to add additional characters according to the following rules for ending a cluster:

End at the end of the subject string.

Do not end between CR and LF; otherwise end after any control char- acter.

Do not break Hangul (a Korean script) syllable sequences. Hangul characters are of five types: L, V, T, LV, and LVT. An L character may be followed by an L, V, LV, or LVT character; an LV or V character may be followed by a V or T character; an LVT or T character may be follwed only by a T character.

Do not end before extending characters or spacing marks. Characters with the "mark" property always have the "extend" grapheme breaking property.

Do not end after prepend characters.

Otherwise, end the cluster.

- Qtax

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- unwind · Accepted Answer

该网站的描述非常好：

\X匹配单个Unicode音节，无论是作为单个代码点还是使用组合标记的多个代码点编码。音节最接近日常概念中的“字符”。\ X匹配编码为U + 0061 U + 0300，编码为U + 00E0，©等的à。

因此，使其支持Unicode的因素是当这些代码点结合成单个可见的“thing”（音节）时，它可以匹配几个代码点。

有关更多详细信息，请参见Wikipedia上的组合字符页面，例如上面提到的U + 0300代码点。