在Python正则表达式中匹配Unicode字符。

Question

在Python正则表达式中匹配Unicode字符。

pythonregexunicodenon-ascii-characterscharacter-properties

32

我已经在Stackoverflow上阅读了其他问题，但仍然无法解决。如果这个问题已经有答案了，我很抱歉，但是我没有找到任何可以解决我的问题的内容。

>>> import re
>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/xmas/xmas1.jpg')
>>> print m.groupdict()
{'tag': 'xmas', 'filename': 'xmas1.jpg'}

一切都很好，然后我尝试一些带有挪威字符（或更多类似于Unicode的内容）的东西：

>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg')
>>> print m.groupdict()
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groupdict'

我该如何匹配像øæå这样的典型 Unicode 字符？我希望能够在上面的标签组和文件名的标签组中匹配这些字符。

- Weholt

请确保对字符串进行规范化，因为不同的码点序列会生成相同的视觉外观。 - janbrohl

3个回答

13

你需要使用UNICODE标志：

m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg', re.UNICODE)

- R. Martinho Fernandes

4

Python3也需要吗？ - Kevin

2

@Kevin - 在Python 3中，您不需要使用unicode标志。 "Unicode匹配在Python 3中已经默认启用了Unicode（str）模式..." - https://docs.python.org/3/howto/regex.html - jeffhale

我不明白，为什么我们需要传入 re.UNICODE？（我在使用Python 3） - Charlie Parker

7

在Python 2中，您需要使用re.UNICODE标志和unicode字符串构造函数。

>>> re.sub(r"[\w]+","___",unicode(",./hello-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./cześć-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./привет-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE)
u',./___\uff0c___-=+'
>>> print re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE)
,./___，___-=+

在后一种情况下，逗号是中文逗号。

- 18446744073709551615

我不明白，为什么我们需要传递 re.UNICODE？（我正在使用Python 3） - Charlie Parker

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Thomas · Accepted Answer

你需要指定 re.UNICODE 标志，并且使用 u 前缀将你的字符串输入为 Unicode 字符串：

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}

在Python 2中需要使用u，但在Python 3中应该省略u，因为所有字符串都是Unicode，并且可以省略re.UNICODE标志。