Python正则表达式模块中的简单大小写折叠和完整大小写折叠有什么区别？

Question

Python正则表达式模块中的简单大小写折叠和完整大小写折叠有什么区别？

pythonregexpython-regex

3

这是我要询问的模块：https://pypi.org/project/regex/，它是Matthew Barnett的regex。

在项目描述页面中，V0和V1之间的行为差异被说明如下（请注意粗体字）：

旧版与新版行为

为了与re模块兼容，此模块有两种行为：

- 版本0行为（旧行为，与re模块兼容）：

请注意，re模块的行为可能会随时间而改变，并且我将努力在版本0中匹配该行为。

- 由VERSION0或V0标志或模式中的(?V0)指示。 - Unicode中的不区分大小写匹配默认使用简单折叠大小写。

- 版本1行为（新行为，可能与re模块不同）：

- 由VERSION1或V1标志或模式中的(?V1)指示。 - Unicode中的不区分大小写匹配默认使用完全折叠大小写。

如果未指定版本，则regex模块将默认为regex.DEFAULT_VERSION。

我自己尝试了一些例子，但没有弄清楚它的作用。

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> r = regex.compile("(?V0i)и")
>>> r
regex.Regex('(?V0i)и', flags=regex.I | regex.V0)
>>> r.search("И")
<regex.Match object; span=(0, 1), match='И'>
>>> regex.search("(?V0i)é", "É")
<regex.Match object; span=(0, 1), match='É'>
>>> regex.search("(?V0i)é", "E")
>>> regex.search("(?V1i)é", "E")

简单折叠和完全折叠之间有什么区别？你能提供一个例子，在 V1 中 (不区分大小写的) 正则表达式匹配了某些东西，但在 V0 中没有吗？

- iBug

1

尚未测试，但可能遵循此表。完全大小写折叠可能会将一些特殊字符替换为两个字符，而简单的大小写折叠则不会。这样的字符包括大写和小写拉丁字母sharp s。 - Michael Butscher

@MichaelButscher 很好，它有效了。如果你把它写成答案的形式，你就可以得到一个绿色的打勾符号。 - iBug

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Michael Butscher · Accepted Answer

它遵循Unicode大小写折叠表。摘录：

# The entries in this file are in the following machine-readable format:
#
# <code>; <status>; <mapping>; # <name>
#
# The status field is:
# C: common case folding, common mappings shared by both simple and full mappings.
# F: full case folding, mappings that cause strings to grow in length. Multiple characters are separated by spaces.
# S: simple case folding, mappings to single characters where different from F.

[...]

# Usage:
#  A. To do a simple case folding, use the mappings with status C + S.
#  B. To do a full case folding, use the mappings with status C + F.

折叠仅在一些特殊字符上有所不同，例如小写和大写的拉丁字母s。

00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S

[...]

1E9E; F; 0073 0073; # LATIN CAPITAL LETTER SHARP S
1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S