Python是否支持基本多文种平面以外的Unicode字符？

Question

Python是否支持基本多文种平面以外的Unicode字符？

6

以下是一个简单的测试。`repr` 看起来没问题。但是在 Python 2.6 和 2.7 中，`len` 和 `x for x in` 似乎无法正确地分割 Unicode 文本：

In [1]: u""
Out[1]: u'\U0002f920\U0002f921'

In [2]: [x for x in u""]
Out[2]: [u'\ud87e', u'\udd20', u'\ud87e', u'\udd21']

好消息是 Python 3.3 做得对 ™。

Python 2.x 系列还有希望吗？

- Dima Tisnek

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

如果您编译Python时启用了宽Unicode支持，那么是可以的。

默认情况下，Python只支持窄的Unicode。使用以下命令启用宽支持：

./configure --enable-unicode=ucs4

您可以通过测试sys.maxunicode来验证所使用的配置。

import sys
if sys.maxunicode == 0x10FFFF:
    print 'Python built with UCS4 (wide unicode) support'
else:
    print 'Python built with UCS2 (narrow unicode) support'

对于所有的Unicode值，宽字符构建将使用UCS4字符，这将使内存使用量增加一倍。Python 3.3切换到变宽值; 只使用足够的字节来表示当前值中的所有字符。

快速演示显示，宽字符构建可以正确处理您的示例Unicode字符串：

$ python2.6
Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
1114111
>>> [x for x in u'\U0002f920\U0002f921']
[u'\U0002f920', u'\U0002f921']