Python Unicode编码错误

Question

Python Unicode编码错误

116

我正在阅读和解析亚马逊的XML文件，但当XML文件显示一个 ' 时，我尝试打印它时会出现以下错误:

'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128)

根据我目前在网上阅读的内容，错误是由于XML文件采用UTF-8编码，而Python希望将其处理为ASCII编码字符所致。是否有简单的方法消除错误并使我的程序在读取时打印XML呢？

- Alex B

我正要到SO上发布这个问题。有没有一种简单的方法来为 unicode() 函数清理字符串？ - Nick Heiner

请同时查看与相关问题有关的此答案：“Python UnicodeDecodeError - 我是否误解了编码？” - tzot

9个回答

17

更好的解决方案：

if type(value) == str:
    # Ignore errors even if the string is not proper UTF-8 or has
    # broken marker bytes.
    # Python built-in function unicode() can do this.
    value = unicode(value, "utf-8", errors="ignore")
else:
    # Assume the value object has proper __unicode__() method
    value = unicode(value)

如果您想了解更多关于为什么的内容：

http://docs.plone.org/manage/troubleshooting/unicode.html#id1

- Paxwell

3

不会解决OP的问题：“无法对字符u'\u2019'进行编码”。 u'\u2019' 已经是Unicode格式的。 - jfs

8

不要在脚本中硬编码环境的字符编码，而是直接输出Unicode文本。

assert isinstance(text, unicode) # or str on Python 3
print(text)

如果您将输出重定向到文件（或管道），可以使用 PYTHONIOENCODING 环境变量指定字符编码：

$ PYTHONIOENCODING=utf-8 python your_script.py >output.utf8

否则，python your_script.py应该可以直接使用——您的区域设置用于对文本进行编码（在POSIX上检查：LC_ALL，LC_CTYPE，LANG envvars——如有必要，请将LANG设置为utf-8语言环境）。
要在Windows上打印Unicode，请参阅此答案，其中显示了如何将Unicode打印到Windows控制台、文件或使用IDLE。

- jfs

2

这是一篇关于IT技术的优秀文章：http://www.carlosble.com/2010/12/understanding-python-and-unicode/

。该文章讲解了Python和Unicode之间的关系，如果您对此感兴趣，可以仔细阅读该文章。

# -*- coding: utf-8 -*-

def __if_number_get_string(number):
    converted_str = number
    if isinstance(number, int) or \
            isinstance(number, float):
        converted_str = str(number)
    return converted_str


def get_unicode(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode
    return unicode(strOrUnicode, encoding, errors='ignore')


def get_string(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode.encode(encoding)
    return strOrUnicode

- Ranvijay Sachan

0

你可以使用类似以下形式的内容

s.decode('utf-8')

这将把一个UTF-8编码的字节串转换成Python Unicode字符串。但是使用确切的过程取决于你如何加载和解析XML文件，例如，如果你从未直接访问XML字符串，你可能需要使用codecs模块中的解码器对象。

- David Z

它已经编码为UTF-8了。错误具体如下：

myStrings = deque([u'Dorf and Svoboda\u2019s text builds on the str... and Computer Engineering\u2019s subdisciplines.'])

正如您所看到的，该字符串已经是UTF-8编码的，但它对内部的'\u2019'感到不满。 - Alex B

哦，好的，我以为你遇到了不同的问题。 - David Z

7

@Alex B: 不，这个字符串是Unicode，不是Utf-8。要将它编码成Utf-8，请使用 '...'.encode('utf-8')。 - sth

0

如果您需要在屏幕上打印字符串的近似表示，而不是忽略那些不可打印的字符，请尝试使用unidecode包：

https://pypi.python.org/pypi/Unidecode

解释可以在这里找到：

https://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

这比为给定字符串u使用u.encode('ascii', 'ignore')更好，如果字符精度不是你想要的，但仍然想要具有人类可读性，则可以避免不必要的麻烦。

Wirawan

- Wirawan Purwanto

0

我写了以下内容来解决烦人的非ASCII引号问题，并强制转换为可用的格式。

unicodeToAsciiMap = {u'\u2019':"'", u'\u2018':"`", }

def unicodeToAscii(inStr):
    try:
        return str(inStr)
    except:
        pass
    outStr = ""
    for i in inStr:
        try:
            outStr = outStr + str(i)
        except:
            if unicodeToAsciiMap.has_key(i):
                outStr = outStr + unicodeToAsciiMap[i]
            else:
                try:
                    print "unicodeToAscii: add to map:", i, repr(i), "(encoded as _)"
                except:
                    print "unicodeToAscii: unknown code (encoded as _)", repr(i)
                outStr = outStr + "_"
    return outStr

- user5910

-1

尝试在你的Python脚本顶部添加以下行。

# _*_ coding:utf-8 _*_

- abnvanand

-2

Python 3.5, 2018

如果您不知道编码，但Unicode解析器出现问题，您可以在Notepad++中打开文件，在顶部菜单栏中选择编码->转换为ANSI。然后您可以像这样编写Python代码

with open('filepath', 'r', encoding='ANSI') as file:
    for word in file.read().split():
        print(word)

- Atomar94

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Scott Stafford · Accepted Answer

很可能，您的问题是已经成功解析了XML，现在您正在尝试打印XML内容，但由于存在一些外来Unicode字符，因此无法打印。首先尝试将Unicode字符串编码为ASCII：

unicodeData.encode('ascii', 'ignore')

'ignore'这部分将告诉它跳过那些字符。来自Python文档的说明：

>>> # Python 2: u = unichr(40960) + u'abcd' + unichr(1972)
>>> u = chr(40960) + u'abcd' + chr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'

你可能想阅读这篇文章：http://www.joelonsoftware.com/articles/Unicode.html，我发现它是一篇非常有用的基础教程，可以帮助你更好地理解Unicode编码。阅读后，你将不再感到只是猜测要使用哪些命令（或者至少我是这样的）。

Python Unicode编码错误

-- coding: latin-1 --