Python3的utf-8解码问题

Question

Python3的utf-8解码问题

4

以下代码在我的Windows机器上使用Python3运行良好，并打印出字符' é '：

data = b"\xc3\xa9"

print(data.decode('utf-8'))

然而，在基于Ubuntu的Docker容器上运行相同操作会导致以下问题：

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

我需要安装什么软件才能启用utf-8解码？

- user3923073

无论如何，指定将给定的字符串解码为“utf-8”应该是有效的。只有在显式指定编解码器为“ascii”时，我才会收到您引用的错误。您的错误还暗示了正在使用ascii编码。多年来，我知道没有任何Linux使用除utf-8以外的默认编码…… - planetmaker

2

@planetmaker：对于一些默认使用LANG=C的Linux系统来说，可能存在一些“最小化”的设置，导致出现问题的是print而不是decode。在相关的shell初始化文件中明确更改为LANG=en_US.utf-8（并注销并重新登录以确保所有地方都正确设置了语言环境）应该可以解决这个问题。 - ShadowRanger

@ShadowRanger，至少在Ubuntu Xenial上不行。我一直从一开始就使用lv_LV.Utf-8语言环境，但Python默认为ASCII。最近尝试在CLI中输入Unicode时才发现这个问题。在文件中，我总是通过注释指定编码。 - Gnudiff

2个回答

4

问题出在print()表达式上，而不是decode()方法上。如果你仔细观察，所引发的异常是一个UnicodeEncodeError，而不是-DecodeError。

每当使用print()函数时，Python会将其参数转换为str，然后将结果编码为bytes，并发送到终端（或者运行Python的任何地方）。用于编码的编解码器（例如UTF-8或ASCII）取决于环境。在理想情况下，

Python使用的编解码器与终端期望的编解码器兼容，因此字符能够正确显示（否则就会出现乱码，如“Ã©”而不是“é”）；
所使用的编解码器涵盖了你需要的字符范围（例如UTF-8或UTF-16，包含所有字符）。

在你的情况下，Linux docker使用的编码不符合第二个条件：所使用的编码是ASCII，只支持老式英文打字机上的字符。以下是解决这个问题的几种方法：

Set environment variables: on Linux, Python's encoding defaults depend on this (at least partially). In my experience, this is a bit of a trial and error; setting LC_ALL to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc.
Re-encode STDOUT, like so:
```
sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')
```
The encoding used has to match the one of the terminal.
Encode the strings yourself and send them to the binary buffer underlying sys.stdout, eg. sys.stdout.buffer.write("é".encode('utf8')). This is of course much more boilerplate than print("é"). Again, the encoding used has to match the one of the terminal.
Avoid print() altogether. Use open(fn, encoding=...) for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).

可能有其他的选择，但我怀疑没有更好的选择了。

- lenz

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- planetmaker · Accepted Answer

似乎Ubuntu - 取决于版本 - 使用一种或另一种编码作为默认值，而且在shell和Python之间也可能会有所不同。从这篇帖子以及这篇博客中采纳的方式：

因此，推荐的方法似乎是告诉你的Python实例使用utf-8作为默认编码：

通过环境变量设置Python源文件的默认编码：

export PYTHONIOENCODING=utf8

此外，在您的源文件中，您可以明确声明所需使用的编码方式，因此它应该可以在任何环境设置下工作（请参见此问题+答案、Python文档和PEP 263）。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
....

关于Python读取文件时的编码解释，你可以在打开命令中明确指定。

with open(fname, "rt", encoding="utf-8") as f:
    ...

“还有一种更加hackish的方法，带有一些副作用，但可以避免每次都显式地指定它。”

import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys)  # Reload does the trick!
sys.setdefaultencoding('UTF8')

请阅读相关答案和评论中关于此黑客的警告。 Related answer