如何正确地转换成Unicode？

Question

如何正确地转换成Unicode？

3

假设你有一个字符串

s = "C:\Users\Eric\Desktop\beeline.txt"

如果您的文本尚未使用Unicode编码，那么您希望将其转换为Unicode编码。

return s if PY3 or type(s) is unicode else unicode(s, "unicode_escape")

如果字符串中可能包含\U（即用户目录），那么您很可能会遇到Unicode解码错误。

UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 3-4: truncated \UXXXXXXXX escape

仅仅强制这样做有什么问题吗：

return s if PY3 or type(s) is unicode else unicode(s.encode('string-escape'), "unicode_escape")

是否明确检查\U的存在，因为这是唯一的特殊情况？

我希望代码在python 2和3中都能正常工作。

- Alex

你可能想在这里使用原始字符串：s = r"C:\Users\Eric\Desktop\beeline.txt" - georg

一定要清楚地确定如何处理像s = r"C:\Users\Eric\Desktop\pr\U000000eat-\U000000e0-porter"这样的输入。 - Alfe

3个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- yuvi · Answer 1

它在英语中运行良好，但当面对实际的Unicode示例时，强制翻译可能不会使用与默认编码相同的编码，从而导致令人不愉快的错误。

我将您提供的代码包装在一个名为assert_unicode的函数中（用isinstance替换了is），并对希伯来语文本进行了测试（只是简单地说“你好”），请查看：

In [1]: def assert_unicode(s):
            return s if isinstance(s, unicode) else unicode(s, 'unicode_escape')    

In [2]: assert_unicode(u'שלום')
Out[2]: u'\u05e9\u05dc\u05d5\u05dd'

In [3]: assert_unicode('שלום')
Out[3]: u'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'

你看到了吗？两个都返回一个Unicode对象，但仍然有很大的区别。如果你尝试打印或使用第二个示例，它可能会失败（例如，对我来说，简单的打印失败了，而我正在使用非常友好的Unicode控制台2）。

解决方法是什么？使用utf-8。这是现在的标准，如果你确保一切都被视为utf-8，那么它应该像魔法一样适用于任何给定的语言：

In [4]: def assert_unicode(s):
            return s if isinstance(s, unicode) else unicode(s, 'utf-8')    

In [5]: assert_unicode(u'שלום')
Out[5]: u'\u05e9\u05dc\u05d5\u05dd'

In [6]: assert_unicode('שלום')
Out[6]: u'\u05e9\u05dc\u05d5\u05dd'

- Paul · Answer 2

下面的程序类似于 @yuvi 的答案，但它会经过多个编码（可配置），并返回所使用的编码。它还可以更优雅地处理错误（只转换 basestring 类型的对象）。

#unicode practice, this routine forces stringish objects to unicode
#preferring utf-8 but works through other encodings on error
#return values are the encoded string and the encoding used
def to_unicode_or_bust_multile_encodings(obj, encoding=['utf-8','latin-1','Windows-1252']):
  'noencoding'
  successfullyEncoded = False
  for elem in encoding:
    if isinstance(obj, basestring):
      if not isinstance(obj, unicode):
        try:
          obj = unicode(obj, elem)
          successfullyEncoded = True
          #if we succeed then exit early
          break
        except:
          #encoding did not work, try the next one
          pass

  if successfullyEncoded:
    return obj, elem
  else:
    return obj,'no_encoding_found'

- jfs · Answer 3

什么是将其转换为Unicode的正确方法？

这里是方法：

unicode_string = bytes_object.decode(character_encoding)

现在问题变成了：我有一系列字节，应该使用什么字符编码将它们转换为Unicode字符串？

答案取决于字节的来源。

在您的情况下，字节串是使用Python字节串字面量指定的（Python 2），因此编码是您的Python源文件的字符编码。如果文件顶部没有字符编码声明（看起来像这样的注释：# -*- coding: utf-8 -*-），则默认源编码为Python 2上的'ascii'（Python 3上为'utf-8'）。因此，在您的情况下，答案是：

if isinstance(s, str) and not PY3:
   return s.decode('ascii')

或者你可以直接使用 Unicode 字面量（适用于 Python 2 和 Python 3.3+）：

unicode_string = u"C:\\Users\\Eric\\Desktop\\beeline.txt"