为什么我需要执行`sys.stdin = codecs.getreader(sys.stdin.encoding)(sys.stdin)`？

Question

为什么我需要执行`sys.stdin = codecs.getreader(sys.stdin.encoding)(sys.stdin)`？

6

我正在编写一个Python程序，它将所有输入转换为大写（替代不工作的tr '[:lowers:]' '[:upper:]'）。本地环境为ru_RU.UTF-8，我使用PYTHONIOENCODING=UTF-8来设置STDIN / STDOUT编码。这正确地设置了sys.stdin.encoding。那么，如果sys.stdin已经知道编码，为什么我仍然需要显式创建解码包装器呢？如果我不创建包装读取器，.upper()函数将无法正确工作（对于非ASCII字符不起作用）。

import sys, codecs
sys.stdin = codecs.getreader(sys.stdin.encoding)(sys.stdin) #Why do I need this?
for line in sys.stdin:
    sys.stdout.write(line.upper())

如果stdin不使用编码，为什么它有.encoding属性？

- Ark-kun

1

@Ark-kun 因为Python2.x使用字节来表示字符串...所以你需要将其转换为Unicode（使用decode）才能使“upper”函数在ASCII范围之外的字符上正常工作。使用Python3.x不应该出现这个问题，因为所有的字符串都是Unicode。 - JBernardo

1

@JBernardo 不被类方法使用的数据不应存在于类中。这就好比 stdin 有一个 .currentphaseofmoon 或 .numberoffilesondisk 属性一样。我的观点是，如果这个属性没有任何作用，那么它是无用和令人困惑的。相比之下，.Net 的 Stream 类（基于字节）没有 .Encoding 属性——只有 StreamReader 有。 - Ark-kun

3

@property本身并不起作用，但它存在是因为如果你想要将数据转换为文本，则可能需要它。仅仅因为你不总是使用某些东西或者因为它不是自动的，并不意味着它没有用处。如果传入的数据是二进制的，并且尝试自动转换为Unicode，那么你会非常生气... - JBernardo

@JBernardo 是的，我明白流是基于字节的，不应该进行任何转换。但我仍然不知道 sys.stdin.encoding 与 locale.getdefaultlocale()[1] 或 locale.getpreferredencoding() 有何不同。P.S. 对于在这些情况下 Python 没有“一种——最好只有一种——显而易见的方法”感到沮丧，向您道歉。 - Ark-kun

1

这些是非常不同的事情。如果进程正在向终端打印，它将尝试发现所使用的编码 - 您可以配置终端使用任何编码（Python获取该信息，但.Net可能不会）。locale模块使用系统范围的信息。 - JBernardo

显示剩余7条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- finiteint · Accepted Answer

为了回答“why”，我们需要了解Python 2.x内置的file类型，file.encoding以及它们之间的关系。

内置的file对象处理原始字节 - 总是读取和写入原始字节。 encoding属性描述流中原始字节的编码。该属性可能存在，也可能不存在，甚至可能不可靠（例如，在标准流的情况下，我们错误地设置了PYTHONIOENCODING）。 file对象执行任何自动转换的唯一时间是将unicode对象写入该流时。在这种情况下，它将使用file.encoding（如果可用）执行转换。

在读取数据的情况下，文件对象不会执行任何转换，因为它返回原始字节。在这种情况下，encoding属性是用户手动执行转换的提示。

file.encoding 是在您的情况下设置的，因为您设置了 PYTHONIOENCODING 变量，sys.stdin 的 encoding 属性也相应地被设置。要获得文本流，我们必须像您在示例代码中所做的那样手动包装它。

换个角度思考，想象一下我们没有单独的文本类型（例如 Python 2.x 的 unicode 或 Python 3 的 str）。我们仍然可以使用原始字节来处理文本，但要跟踪所使用的编码。这就是 file.encoding 所用的方式（用于跟踪编码）。我们创建的读取器包装器会自动进行跟踪和转换。

当然，自动包装 sys.stdin 会更好（这就是 Python 3.x 所做的），但更改 Python 2.x 中 sys.stdin 的默认行为将破坏向后兼容性。

以下是 Python 2.x 和 3.x 中 sys.stdin 的比较：

# Python 2.7.4
>>> import sys
>>> type(sys.stdin)
<type 'file'>
>>> sys.stdin.encoding
'UTF-8'
>>> w = sys.stdin.readline()
## ... type stuff - enter
>>> type(w)
<type 'str'>           # In Python 2.x str is just raw bytes
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')

自Python 2.6起，io.TextIOWrapper类已成为标准库的一部分。该类具有一个encoding属性，用于将原始字节转换为Unicode并进行互相转换。

# Python 3.3.1
>>> import sys
>>> type(sys.stdin)
<class '_io.TextIOWrapper'>
>>> sys.stdin.encoding
'UTF-8'
>>> w = sys.stdin.readline()
## ... type stuff - enter
>>> type(w)
<class 'str'>        # In Python 3.x str is Unicode
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')

buffer属性提供了访问支持stdin的原始字节流的方式；通常是一个BufferedReader。请注意，它没有encoding属性。

# Python 3.3.1 again
>>> type(sys.stdin.buffer)
<class '_io.BufferedReader'>
>>> w = sys.stdin.buffer.readline()
## ... type stuff - enter
>>> type(w)
<class 'bytes'>      # bytes is (kind of) equivalent to Python 2 str
>>> sys.stdin.buffer.encoding
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: '_io.BufferedReader' object has no attribute 'encoding'

在Python 3中，encoding属性的存在与否与使用的流类型一致。