如何在Python 2和3中获得相同的Unicode字符串长度?

4

Python 2/3真是令人沮丧...考虑下面这个例子:test.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
if sys.version_info[0] < 3:
  text_type = unicode
  binary_type = str
  def b(x):
    return x
  def u(x):
    return unicode(x, "utf-8")
else:
  text_type = str
  binary_type = bytes
  import codecs
  def b(x):
    return codecs.latin_1_encode(x)[0]
  def u(x):
    return x

tstr = " ▲ "

sys.stderr.write(tstr)
sys.stderr.write("\n")
sys.stderr.write(str(len(tstr)))
sys.stderr.write("\n")

运行它:

$ python2.7 test.py 
 ▲ 
5
$ python3.2 test.py 
 ▲ 
3

太好了,我得到了两种不同的字符串大小。希望使用我在网络上找到的其中一种包装器能够有所帮助?

对于tstr = text_type(" ▲ ")

$ python2.7 test.py 
Traceback (most recent call last):
  File "test.py", line 21, in <module>
    tstr = text_type(" ▲ ")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)
$ python3.2 test.py 
 ▲ 
3

For tstr = u(" ▲ "):

$ python2.7 test.py 
Traceback (most recent call last):
  File "test.py", line 21, in <module>
    tstr = u(" ▲ ")
  File "test.py", line 11, in u
    return unicode(x)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)
$ python3.2 test.py 
 ▲ 
3

对于tstr = b(" ▲ ")

$ python2.7 test.py 
 ▲ 
5
$ python3.2 test.py 
Traceback (most recent call last):
  File "test.py", line 21, in <module>
    tstr = b(" ▲ ")
  File "test.py", line 17, in b
    return codecs.latin_1_encode(x)[0]
UnicodeEncodeError: 'latin-1' codec can't encode character '\u25b2' in position 1: ordinal not in range(256)

对于tstr = binary_type(" ▲ ")

$ python2.7 test.py 
 ▲ 
5
$ python3.2 test.py 
Traceback (most recent call last):
  File "test.py", line 21, in <module>
    tstr = binary_type(" ▲ ")
TypeError: string argument without an encoding

那么,这确实让事情变得简单。

那么,在Python 2.7和3.2中如何获得相同的字符串长度(在此示例中为3)?

1个回答

5

好的,事实证明Python 2.7中的unicode()有一个encoding参数,这显然是有帮助的:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
if sys.version_info[0] < 3:
  text_type = unicode
  binary_type = str
  def b(x):
    return x
  def u(x):
    return unicode(x, "utf-8")
else:
  text_type = str
  binary_type = bytes
  import codecs
  def b(x):
    return codecs.latin_1_encode(x)[0]
  def u(x):
    return x

tstr = u(" ▲ ")

sys.stderr.write(tstr)
sys.stderr.write("\n")
sys.stderr.write(str(len(tstr)))
sys.stderr.write("\n")

运行此代码,我得到了所需的结果:
$ python2.7 test.py 
 ▲ 
3
$ python3.2 test.py 
 ▲ 
3

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接