Python标准惯用语将sys.stdout缓冲区设置为零在处理Unicode时无效。

Question

Python标准惯用语将sys.stdout缓冲区设置为零在处理Unicode时无效。

6

当我在编写Python系统管理员脚本时，影响每次print()调用的sys.stdout缓冲区很麻烦，因为我不想等待缓冲区被刷新，然后一次性在屏幕上得到一大块输出，而是希望尽快获得脚本生成的单个输出行。我甚至不想等待换行符来查看输出。

在Python中经常使用的习语是：

import os
import sys
sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)

这个对我来说一直很好用。现在我注意到它不支持Unicode了。请看下面的脚本：

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import print_function, unicode_literals

import os
import sys

print('Original encoding: {}'.format(sys.stdout.encoding))
sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)
print('New encoding: {}'.format(sys.stdout.encoding))

text = b'Eisb\xe4r'
print(type(text))
print(text)

text = text.decode('latin-1')
print(type(text))
print(text)

这将导致以下输出：

Original encoding: UTF-8
New encoding: None
<type 'str'>
Eisb▒r
<type 'unicode'>
Traceback (most recent call last):
  File "./export_debug.py", line 18, in <module>
    print(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 4: ordinal not in range(128)

我花了几个小时才找到了它的原因（我的原始脚本比这个最小化调试脚本还要长）。问题出在这一行：

sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)

我曾经使用这个工具多年，所以没有预料到会出现任何问题。只需将此行注释掉，正确的输出应该是像这样的：

Original encoding: UTF-8
New encoding: UTF-8
<type 'str'>
Eisb▒r
<type 'unicode'>
Eisbär

那么这个脚本的作用是什么？为了让我的 Python 2.7 代码尽可能接近 Python 3.x，我一直在使用

from __future__ import print_function, unicode_literals

这使得Python使用新的print()函数，但更重要的是：它默认将所有字符串作为Unicode内部存储。例如，我有许多Latin-1 / ISO-8859-1编码的数据。

text = b'Eisb\xe4r'

为了按照既定方式处理它，我需要先将其解码为Unicode，这就是要做的事情。

text = text.decode('latin-1')

这是关于IT技术的内容。由于我系统上的默认编码是UTF-8，所以每当我打印一个字符串时，Python都会将内部的Unicode字符串编码为UTF-8。但首先它必须在内部完全是Unicode。

总的来说，这一切都很好运作，只是到目前为止还不能使用零字节输出缓冲区。有什么想法吗？我注意到在零缓冲行后sys.stdout.encoding未设置，但我不知道如何再次设置它。它是只读属性，操作系统环境变量LC_ALL或LC_CTYPE似乎只在Python解释器启动时进行评估。

顺便说一下：“Eisbär”是德语中“北极熊”的意思。

- Marten Lehmann

@martineau 嗯，sys.stdout = codecs.getwriter('utf8')(sys.stdout) 这个提议也不起作用。我真的尝试了很多次并进行了搜索。所以我猜没有经过测试的想法并没有什么帮助。 - Marten Lehmann

我已经为您迁移了该问题。下次，只需使用 flag 引起版主的注意并告诉我们您需要什么！ :) - slhck

@MartenLehmann：它未经测试的事实是我将其发布为评论而不是答案的原因。 - martineau

1

你有没有考虑过：alias python="python -u"，不要修改sys.stdout。顺便说一句，codecs.getwriter...仅在您（以及您使用的所有库）仅打印Unicode文本时才有效（因此通常不建议使用）。 - jfs

2个回答

0

sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)中的'wb'参数中的b表示该文件应以二进制模式打开，这就是为什么Unicode无法工作的原因。此外，在Python 3中，我不能将普通字符串打印到配置为这种方式的标准输出中；它会显示TypeError：需要类似字节的对象，而不是'str'。

对于“系统管理员脚本”的上述用例，使用行缓冲应该足够了，即每当写入换行符时刷新输出，例如在每个正常的print（“mytext”）语句的末尾。对于行缓冲，只需编写：

import os
import sys

sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 1)  # 1 : line buffered

我发现这是必要的，以便在标准输出被重定向到管道（最简单的情况：./myprogram.py | cat）并且可能被另一个程序读取时，能够逐行输出。

如果你需要立即刷新部分行，你可以使用：

print("mytext", end="", flush=True)

- tistolz

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

打印函数在写入文件对象时使用特殊标志，导致Python C API的PyFile_WriteObject函数检索输出编码以进行unicode-to-bytes转换，并且通过替换stdout流，您失去了编码。不幸的是，您不能再次显式设置它:

encoding = sys.stdout.encoding
sys.stdout = os.fdopen(sys.stdout.fileno(), 'wb', 0)
sys.stdout.encoding = encoding  # Raises a TypeError; readonly attribute

你也不能使用io.open函数，因为如果你想使用encoding选项，它不允许禁用缓冲。要立即刷新print函数的正确方法是使用flush=True关键字：

print(something, flush=True)

如果在各处添加这些内容太繁琐，考虑使用自定义打印函数：

def print(*args, **kw):
    flush = kw.pop('flush', True)  # Python 2.7 doesn't support the flush keyword..   
    __builtins__.print(*args, **kw)
    if flush:
        sys.stdout.flush()

由于Python 2.7的print()函数实际上还不支持刷新关键字（非常麻烦），因此您可以通过在自定义版本中添加显式刷新来模拟该功能。