在Python 3 CGI脚本中设置编码

Question

在Python 3 CGI脚本中设置编码

23

编写 Python 3.1 CGI 脚本时，遇到了可怕的 UnicodeDecodeErrors 错误。但是，在命令行上运行脚本时一切正常。

似乎 open() 和 print() 函数默认使用 locale.getpreferredencoding() 的返回值来确定要使用哪种编码方式。在命令行下运行时，此值为 'UTF-8'，这是应该的。但是通过浏览器运行脚本时，编码方式神秘地被重新定义为“ANSI_X3.4-1968”，看起来只是普通 ASCII 的花式名称。

我现在需要知道如何让 CGI 脚本在所有情况下都以“utf-8”作为默认编码运行。我的设置是 Python 3.1.3 和 Debian Linux 上的 Apache2。系统范围内的区域设置为 en_GB.utf-8。

- jforberg

7个回答

5

对于CGI/WSGI，您不应该将IO流读取为字符串；它们不是Unicode字符串，而是明确的字节序列。

（请考虑，Content-Length以字节而不是字符计量；想象一下尝试读取压缩成UTF-8解码字符串的multipart/form-data二进制文件上传提交或返回二进制文件下载...）

因此，使用sys.stdin.buffer和sys.stdout.buffer获取stdio的原始字节流，并使用它们进行二进制读/写。由表单读取层使用适当的编码将这些字节转换为Unicode字符串参数。

不幸的是，在Python 3.1中，标准库CGI和WSGI接口没有正确处理这个问题：相关模块是使用2to3从Python 2转换而来的，因此存在许多会导致UnicodeError的错误。

可用于Web应用程序的第一个Python 3版本是3.2。使用3.0 / 3.1几乎是浪费时间。花费了令人遗憾的长时间才解决这个问题并通过PEP3333。

- bobince

我同意。尽管现在默认情况下所有文本和文件都应该是Unicode，但强制ASCII模式对于一个软件包来说似乎是非常糟糕的行为。目前Debian（稳定版）还没有Python 3.2，所以我现在基本上只能使用3.1提供的功能了。 - jforberg

近10年后重新审视，这是正确的答案。Apache HTTPD不进行任何编码/解码，严格来说是Python层在执行此操作。数据的输入/输出与主机无关。源数据来自客户端，结果将发送回客户端。 - Bretton Wade

4

我用以下代码解决了我的问题：

import locale                                  # Ensures that subsequent open()s 
locale.getpreferredencoding = lambda: 'UTF-8'  # are UTF-8 encoded.

import sys                                     
sys.stdin = open('/dev/stdin', 'r')       # Re-open standard files in UTF-8 
sys.stdout = open('/dev/stdout', 'w')     # mode.
sys.stderr = open('/dev/stderr', 'w')

这个解决方案并不完美，但目前看来似乎可行。我选择使用Python 3作为开发平台，而不是更为普遍的v.2.6，因为它被广告称具有良好的Unicode处理能力，但cgi包似乎破坏了一些简单性。

据我所知，在没有procfs的较旧系统上，/dev/std*文件可能不存在。然而，它们在最近的Linux版本中得到支持。

- jforberg

我尝试了@cercatrova上面的答案(编辑/etc/apache2/envvars)，但不幸的是那没起作用。@jforberg的解决方案可行，虽然我不得不将UTF-8更改为latin-1。 - ErikusMaximus

3

总结@cercatrova的回答：

在/etc/apache2/apache2.conf或.htaccess文件末尾添加PassEnv LANG行。
取消注释/etc/apache2/envvars中的. /etc/default/locale行。
确保类似于LANG="en_US.UTF-8"的行存在于/etc/default/locale文件中。
sudo service apache2 restart

- Klesun

这个解决方案成功地解决了持续几天的问题，即如何通过Symfony进程（或PHP exec）在Apache2和Nginx中执行cskit。谢谢你！ - cherrysoft

谢谢，这对我很有帮助。我已经苦苦挣扎了两天。 - Lingjing France

2

简短回答：如mod_cgi + utf8 + Python3 produces no output中所述，只需要在.htaccess文件中添加以下内容：

SetEnv PYTHONIOENCODING utf8

随着:

Options +ExecCGI
AddHandler cgi-script .py

- Basj

1

你最好的选择是使用你想要使用的编码将你的Unicode字符串显式编码成字节。依靠隐式转换会导致像这样的问题。

顺便说一句：如果错误确实是UnicodeDecodeError，那么它并不会发生在输出时，而是尝试将字节流解码为Unicode，这会发生在其他地方。

- Ned Batchelder

脚本既涉及文件输入又涉及输出，因此我会遇到解码和编码错误。由于cgi包强制使用ASCII模式，我的Unicode编码文件无法正常读取。 - jforberg

0

我遇到了同样的问题。我的环境是Windows10 + Apache 2.4 + Python 3.8。
由于我正在为Google Earth Pro开发叠加层，它只接受CGI来获取动态内容。
在最佳答案中，这是原因，但方法不起作用。
我的解决方案是：

sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)

它工作得很好。

- Ryan Tu

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- cercatrova · Accepted Answer

为了回答那些迟到的同学，我认为之前的答案都没有深入到问题的根本原因，这就是在CGI环境中缺少locale环境变量。我正在使用Python 3.2。

open() opens file objects in text (string) or binary (bytes) mode for reading and/or writing; in text mode the encoding used to encode strings written to the file, and decode bytes read from the file, may be specified in the call; if it isn't then it is determined by locale.getpreferredencoding(), which on linux uses the encoding from your locale environment settings, which is normally utf-8 (from e.g. LANG=en_US.UTF-8)
```
>>> f = open('foo', 'w')         # open file for writing in text mode
>>> f.encoding
'UTF-8'                          # encoding is from the environment
>>> f.write('€')                 # write a Unicode string
1
>>> f.close()
>>> exit()
user@host:~$ hd foo
00000000  e2 82 ac      |...|    # data is UTF-8 encoded
```
sys.stdout is in fact a file opened for writing in text mode with an encoding based on locale.getpreferredencoding(); you can write strings to it just fine and they'll be encoded to bytes based on sys.stdout's encoding; print() by default writes to sys.stdout - print() itself has no encoding, rather it's the file it writes to that has an encoding;
```
>>> sys.stdout.encoding
'UTF-8'                          # encoding is from the environment
>>> exit()
user@host:~$ python3 -c 'print("€")' > foo
user@host:~$ hd foo
00000000  e2 82 ac 0a   |....|   # data is UTF-8 encoded; \n is from print()
```
; you cannot write bytes to sys.stdout - use sys.stdout.buffer.write() for that; if you try to write bytes to sys.stdout using sys.stdout.write() then it will return an error, and if you try using print() then print() will simply turn the bytes object into a string object and an escape sequence like \xff will be treated as the four characters \, x, f, f
```
user@host:~$ python3 -c 'print(b"\xe2\xf82\xac")' > foo
user@host:~$ hd foo
00000000  62 27 5c 78 65 32 5c 78  66 38 32 5c 78 61 63 27  |b'\xe2\xf82\xac'|
00000010  0a                                                |.|
```
in a CGI script you need to write to sys.stdout and you can use print() to do it; but a CGI script process in Apache has no locale environment settings - they are not part of the CGI specification; therefore the sys.stdout encoding defaults to ANSI_X3.4-1968 - in other words, ASCII; if you try to print() a string that contain non-ASCII characters to sys.stdout you'll get "UnicodeEncodeError: 'ascii' codec can't encode character...: ordinal not in range(128)"
a simple solution is to pass the Apache process's LANG environment variable through to the CGI script using Apache's mod_env PassEnv command in the server or virtual host configuration: PassEnv LANG; on Debian/Ubuntu make sure that in /etc/apache2/envvars you have uncommented the line ". /etc/default/locale" so that Apache runs with the system default locale and not the C (Posix) locale (which is also ASCII encoding); the following CGI script should run without errors in Python 3.2:
```
#!/usr/bin/env python3
import sys
print('Content-Type: text/html; charset=utf-8')
print()
print('<html><body><pre>' + sys.stdout.encoding + '</pre>h€lló wörld<body></html>')
```