使用Django和格式化字符串时出现UnicodeDecodeError错误

Question

使用Django和格式化字符串时出现UnicodeDecodeError错误

10

我写了一个小例子，使用Python 2.7和Django 1.10.8，让大家看看发生了什么问题。

# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, unicode_literals, print_function

import time
from django import setup
setup()
from django.contrib.auth.models import Group

group = Group(name='schön')

print(type(repr(group)))
print(type(str(group)))
print(type(unicode(group)))

print(group)
print(repr(group))
print(str(group))
print(unicode(group))

time.sleep(1.0)
print('%s' % group)
print('%r' % group)   # fails
print('%s' % [group]) # fails
print('%r' % [group]) # fails

退出并附带以下输出+回溯信息。

$ python .PyCharmCE2017.2/config/scratches/scratch.py
<type 'str'>
<type 'str'>
<type 'unicode'>
schön
<Group: schön>
schön
schön
schön
Traceback (most recent call last):
  File "/home/srkunze/.PyCharmCE2017.2/config/scratches/scratch.py", line 22, in <module>
    print('%r' % group) # fails
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

有人知道这里到底发生了什么吗？

- Sven R. Kunze

@DavidBern 完成。 - Sven R. Kunze

我有一种感觉，你在组类定义中的表示方法正在做一些淘气的事情 :P - David Bern

1

我很难复制您的错误。但是，我也遇到了同样的问题。原因是在__repr__方法中解码了一个"ä"。最初这个方法可以正常工作，直到有一天我从__future__中导入了unicode_literals模块。那时的解决方案就是简单地删除解码和__repr__的使用，改为返回unicode。 - David Bern

1

你正在对 Unicode 字符串进行插值，这包括隐式解码。请改用 b'...' 字节字符串。 - Martijn Pieters

1

@DavidBern：这是一个非常简单的可重现问题: u'%s' % '<Group: sch\xc3\xb6n>'。这里存在的问题是OP使用了from __future__ import unicode_literals。 - Martijn Pieters

显示剩余4条评论

5个回答

3

我很难找到你问题的通用解决方案。

__repr__()应该返回字符串，任何试图改变它的努力似乎都会引起新的问题。

关于__repr__()方法定义在项目之外的事实，你可以重载方法。例如：

def new_repr(self):
    return 'My representation of self {}'.format(self.name)

Group.add_to_class("__repr__", new_repr)

我能找到的唯一有效解决方案是显式地告诉解释器如何处理字符串。

from __future__ import unicode_literals
from django.contrib.auth.models import Group

group = Group(name='schön')

print(type(repr(group)))
print(type(str(group)))
print(type(unicode(group)))

print(group)
print(repr(group))
print(str(group))
print(unicode(group))

print('%s' % group)
print('%r' % repr(group))
print('%s' % [str(group)])
print('%r' % [repr(group)])

# added
print('{}'.format([repr(group).decode("utf-8")]))
print('{}'.format([repr(group)]))
print('{}'.format(group))

在Python 2.x中处理字符串非常麻烦。希望这篇文章能为你提供一些解决方法（因为这是我找到的唯一方法）。

- David Bern

1

我认为真正的问题在django代码中。六年前就有人报告了这个问题：

https://code.djangoproject.com/ticket/18063

我认为对Django进行补丁(patch)会解决它：

def __repr__(self):
    return self.....encode('ascii', 'replace')

我认为repr()方法应该返回“7位ASCII”。

- guettli

@TechJS 是在这个上下文中不重要的占位符。 - guettli

不确定 Django 是否有错。就像 Python 2 中的 __str__ 一样，没有要求返回 ASCII 安全数据。当然，核心 Python 类型确实会这样做，但这并不是一个规定要求。 - Martijn Pieters

1

@GhostlyMartijn 是的，你说得对。我查看了文档。虽然没有官方要求，但我认为这是“最佳实践”。或者说是“避免混淆”。文档：https://docs.python.org/2/reference/datamodel.html#object.__repr__ - guettli

1

嗯，这里的原因仍然是OP混合了字节串和Unicode字符串。如果字符串是ASCII安全的，事情就会发生作用，但无论如何你都不应该混合它们。Python 3将完全防止这种情况的发生。 - Martijn Pieters

-1

如果是这种情况，那么我们需要使用自定义方法来覆盖unicode方法。尝试下面的代码，它会起作用的。我已经测试过了。

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

from django.contrib.auth.models import Group

def custom_unicode(self):
    return u"%s" % (self.name.encode('utf-8', 'ignore'))
Group.__unicode__ = custom_unicode

group = Group(name='schön')

# Tests
print(type(repr(group)))
print(type(str(group)))
print(type(unicode(group)))

print(group)
print(repr(group))
print(str(group))
print(unicode(group))

print('%s' % group)
print('%r' % group)  
print('%s' % [group])
print('%r' % [group])

# output:
<type 'str'>
<type 'str'>
<type 'unicode'>
schön
<Group: schön>
schön
schön
schön
<Group: schön>
[<Group: schön>]
[<Group: schön>]

参考资料：https://docs.python.org/2/howto/unicode.html

- anjaneyulubatta505

{btsdaf} - guettli

{btsdaf} - anjaneyulubatta505

这并没有解决问题。模型仍然会在 __repr__ 中具有非ASCII字节。但是现在您的 __unicode__ 结果中也有UTF-8字节 不应该存在。而且由于您更改了默认编码，所以您最终得到完全相同的结果，因此整个操作都是无意义的。 - Martijn Pieters

你所做的唯一让它工作的事情就是 setdefaultencoding() 调用，这就像把一根棍子绑在你的断腿上。这是错误的解决方案。你应该修复骨折，也就是在第一时间不要混合字节串和 Unicode 文本。 - Martijn Pieters

-1

我不熟悉Django。您的问题似乎是将实际上是Unicode的文本数据表示为ASCII。请尝试在Python中使用unidecode模块。

from unidecode import unidecode
#print(string) is replaced with 
print(unidecode(string))

参考 Unidecode

- Sreeragh A R

1

这并不需要。该模块非常适用于编码目标仅限于ASCII的情况，但这里并非如此。他们已经有了编码字节，问题在于隐式的解码回Unicode。该模块对此无济于事。 - Martijn Pieters

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

在这里的问题是您将UTF-8字节串插入到Unicode字符串中。您的'%r'字符串是Unicode字符串，因为您使用了from __future__ import unicode_literals，但repr(group)（由%r占位符使用）返回一个字节串。对于Django模型，repr()可以在表示中包含编码为UTF-8的Unicode数据的字节串。这样的表示不安全ASCII。

对于您特定的示例，Group实例的repr()产生字节串'<Group: sch\xc3\xb6n>'。将其插入到Unicode字符串中会触发隐式解码：

>>> u'%s' % '<Group: sch\xc3\xb6n>'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

请注意，我在Python会话中没有使用from __future__ import unicode_literals，因此'<Group: sch\xc3\xb6n>'字符串不是一个unicode对象，而是一个str字节串对象！

在Python 2中，应避免混合使用Unicode和字节串。始终明确地规范化您的数据（将Unicode编码为字节或将字节解码为Unicode）。

如果您必须使用from __future__ import unicode_literals，则仍然可以通过使用b前缀创建字节串：

>>> from __future__ import unicode_literals
>>> type('')   # empty unicode string
<type 'unicode'>
>>> type(b'')  # empty bytestring, note the b prefix
<type 'str'>
>>> b'%s' % b'<Group: sch\xc3\xb6n>'  # two bytestrings
'<Group: sch\xc3\xb6n>'