Python: UnicodeEncodeError: 'latin-1'编解码器无法编码字符

Question

Python: UnicodeEncodeError: 'latin-1'编解码器无法编码字符

18

我现在的情况是，我调用API并根据API返回的结果为每个记录调用数据库。我的API调用返回字符串，当我为API返回的项目进行数据库调用时，对于某些元素，我会收到以下错误。

Traceback (most recent call last):
  File "TopLevelCategories.py", line 267, in <module>
    cursor.execute(categoryQuery, {'title': startCategory});
  File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/cursors.py", line 158, in execute
    query = query % db.literal(args)
  File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 265, in literal
    return self.escape(o, self.encoders)
  File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 203, in unicode_literal
    return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)

上述错误所指的代码段为：

         ...    
         for startCategory in value[0]:
            categoryResults = []
            try:
                categoryRow = ""
                baseCategoryTree[startCategory] = []
                #print categoryQuery % {'title': startCategory}; 
                cursor.execute(categoryQuery, {'title': startCategory}) #unicode issue
                done = False
                cont...

在进行了一些谷歌搜索后，我尝试在命令行上执行以下操作以了解发生了什么...

>>> import sys
>>> u'\u2013'.encode('iso-8859-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 0: ordinal not in range(256)
>>> u'\u2013'.encode('cp1252')
'\x96'
>>> '\u2013'.encode('cp1252')
'\\u2013'
>>> u'\u2013'.encode('cp1252')
'\x96'

但我不确定如何解决这个问题。我也不知道encode('cp1252')背后的理论是什么，如果我能得到一些关于我上面尝试的内容的解释，那就太好了。

- add-semi-colons

1

可能是重复的问题：UnicodeEncodeError: 'latin-1' codec can't encode character - ivan_pozdeev

3个回答

3

Unicode字符u'\02013'是“en dash”，它包含在Windows-1252（cp1252）字符集中（使用编码x96），但不包含在Latin-1（iso-8859-1）字符集中。Windows-1252字符集在x80-x9f区域定义了一些其他字符，其中包括en dash。

解决方案是选择与Latin-1不同的目标字符集，例如Windows-1252或UTF-8，或者用简单的“-”替换en dash。

- Cito

1

u.encode('utf-8')将其转换为字节，然后可以使用sys.stdout.buffer.write(bytes)在stdout上打印。请查看https://docs.python.org/3/library/sys.html中的displayhook。

- PriyankaP

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Raymond Hettinger · Accepted Answer

如果您需要Latin-1编码，您有几个选项可以去掉255以上的代码点（不包括Latin-1中的字符），例如en-dash：

>>> u = u'hello\u2013world'
>>> u.encode('latin-1', 'replace')    # replace it with a question mark
'hello?world'
>>> u.encode('latin-1', 'ignore')     # ignore it
'helloworld'

或者进行自定义替换：

>>> u.replace(u'\u2013', '-').encode('latin-1')
'hello-world'

如果不需要输出 Latin-1，则 UTF-8 是一种常见且首选的选择。它被 W3C 推荐，并可以很好地编码所有 Unicode 代码点：

>>> u.encode('utf-8')
'hello\xe2\x80\x93world'