匹配任何Unicode字母？

Question

匹配任何Unicode字母？

pythonregexcharacter-properties

20

在.NET中，您可以使用\p{L}来匹配任何字母，请问在Python中如何实现相同的功能？也就是说，我想匹配任何大写字母、小写字母和带重音符号的字母。

- mpen

1

请参见：https://dev59.com/eHI-5IYBdhLWcg3weoTR - Jeff Mercado

2

你知道在2.x版本中'é'不是一个unicode吧？（参考链接：http://farmdev.com/talks/unicode/） - Ignacio Vazquez-Abrams

@Ignacio/Tim：哦！对了，我忘记了！谢谢:D 这有点令人困惑，因为它不会抛出任何错误。 - mpen

2个回答

9

PyPi正则表达式模块支持\p{L} Unicode 属性类以及更多内容，请参阅文档中的“Unicode 代码点属性，包括脚本和块”部分，完整列表请查看http://www.unicode.org/Public/UNIDATA/PropList.txt。使用regex模块非常方便，因为您可以在任何Python版本中得到一致的结果（请注意Unicode标准不断发展，支持的字母数量也在增加）。

使用pip install regex（或 pip3 install regex）安装库并使用。

\p{L}        # To match any Unicode letter
\p{Lu}       # To match any uppercase Unicode letter
\p{Ll}       # To match any lowercase Unicode letter
\p{L}\p{M}*  # To match any Unicode letter and any amount of diacritics after it

以下是一些用法示例：

import regex
text = r'Abc-++-Абв. It’s “Łąć”!'
# Removing letters:
print( regex.sub(r'\p{L}+', '', text) ) # => -++-. ’ “”!
# Extracting letter chunks:
print( regex.findall(r'\p{L}+', text) ) # => ['Abc', 'Абв', 'It', 's', 'Łąć']
# Removing all but letters:
print( regex.sub(r'\P{L}+', '', text) ) # => AbcАбвItsŁąć
# Removing all letters but ASCII letters:
print( regex.sub(r'[^\P{L}a-zA-Z]+', '', text) ) # => Abc-++-. It’s “”!

查看在线Python演示

- Wiktor Stribiżew

值得一提的是，PiPy正则表达式模块还支持POSIX字符类，因此可以在字符类中使用[:alpha:]（任何字母）、[:lower:]（所有小写字母）和[:upper:]（匹配所有大写字母）来匹配各种字母。请注意，这些POSIX字符类可以像速记字符类一样被否定。例如，要匹配除字母以外的任何字符，可以使用[:^alpha:]。最后一个regex.sub正则表达式（[^\P{L}a-zA-Z]+）可以写成[^[:^alpha:]a-zA-Z]+。 - Wiktor Stribiżew

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tim Pietzcker · Accepted Answer

Python的re模块尚不支持Unicode属性。但是，您可以使用re.UNICODE标志编译正则表达式，然后字符类速记\w也将匹配Unicode字母。

由于\w还将匹配数字，因此您需要从字符类中减去它们，以及下划线：

[^\W\d_]

将匹配任何Unicode字符。

>>> import re
>>> r = re.compile(r'[^\W\d_]', re.U)
>>> r.match('x')
<_sre.SRE_Match object at 0x0000000001DBCF38>
>>> r.match(u'é')
<_sre.SRE_Match object at 0x0000000002253030>