Python 2.7.8 的默认编码是什么？

Question

Python 2.7.8 的默认编码是什么？

pythonpython-2.7encodingcharacter-encoding

4

当我使用codecs.open('f.txt', 'r', encoding=None)打开文件时，Python 2.7.8会选择一些默认编码。它是什么？这在哪里有记录？

一些实验表明，默认编码不是utf-8、ascii、sys.getdefaultencoding()、locale.getpreferredencoding()或locale.getpreferredencoding(False)。

编辑（澄清我的动机）：我想知道当我运行像这样的脚本时，Python 2.7.8选择了哪种编码。

f = codecs.open('f.txt', 'r', encoding=None) # or equivalently: f=open('f.txt')
for line in f:
    print len(line) # obviously SOME encoding has been chosen if I can print the number of characters

我对其他猜测文件编码的方法不感兴趣。

- tba

1

Python的默认编码是ASCII，如此记录在这里：https://docs.python.org/2/howto/unicode.html#encodings - n1c9

那么我们该如何解释这个呢？http://i.imgur.com/Pw36l9B.png - tba

2个回答

1

使用codecs.open('f.txt','r',encoding=None)读取文件时返回的是字节字符串而不是Unicode字符串。它根本不尝试使用编码解码文件数据。这相当于open('f.txt','r')。您收到的长度是存储在文件中的行中的单个字节数，没有进行任何转换。

一个小例子：

>>> import codecs
>>> codecs.open('f.txt','r',encoding=None).read()
'abc\n'
>>> codecs.open('f.txt','r',encoding='ascii').read() # Note Unicode string returned.
u'abc\r\n'
>>> open('f.txt','r').read()
'abc\n'

- Mark Tolonen

1

实际上，我相信尽管Python文档的说明，codecs.open('f.txt','r',encoding=None)实际上等同于open('f.txt','r')而不是open('f.txt','rb')。只有在指定编码时才会添加'b'。请查看我回答中发布的库代码。 - Stephen Briney

@StephenBriney，你说得对。我会更新的。证据也在第一行的codecs.open中。它只返回了\n而不是\r\n，这表明它是文本模式而不是二进制模式。我发帖时没有注意到这一点。 - Mark Tolonen

我认为文档的描述非常误导人，它说“即使没有指定二进制模式，文件也总是以二进制模式打开。” - Stephen Briney

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Stephen Briney · Accepted Answer

它基本上不会进行任何透明的编码/解码，只是打开文件并返回它。

这是来自库的代码：-

def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):

    """ Open an encoded file using the given mode and return
        a wrapped version providing transparent encoding/decoding.
        Note: The wrapped version will only accept the object format
        defined by the codecs, i.e. Unicode objects for most builtin
        codecs. Output is also codec dependent and will usually be
        Unicode as well.
        Files are always opened in binary mode, even if no binary mode
        was specified. This is done to avoid data loss due to encodings
        using 8-bit values. The default file mode is 'rb' meaning to
        open the file in binary read mode.
        encoding specifies the encoding which is to be used for the
        file.
        errors may be given to define the error handling. It defaults
        to 'strict' which causes ValueErrors to be raised in case an
        encoding error occurs.
        buffering has the same meaning as for the builtin open() API.
        It defaults to line buffered.
        The returned wrapped file object provides an extra attribute
        .encoding which allows querying the used encoding. This
        attribute is only available if an encoding was specified as
        parameter.
    """
    if encoding is not None:
        if 'U' in mode:
            # No automatic conversion of '\n' is done on reading and writing
            mode = mode.strip().replace('U', '')
            if mode[:1] not in set('rwa'):
                mode = 'r' + mode
        if 'b' not in mode:
            # Force opening of the file in binary mode
            mode = mode + 'b'
    file = __builtin__.open(filename, mode, buffering)
    if encoding is None:
        return file
    info = lookup(encoding)
    srw = StreamReaderWriter(file, info.streamreader, info.streamwriter, errors)
    # Add attributes to simplify introspection
    srw.encoding = encoding
    return srw

如果编码为None，则返回已打开的文件，如您所见。

以下是每个字节表示为十进制数并显示其对应的ASCII字符的文件：

46  .
46  .

46  .
32  'space'

48  0
45  -

49  1
10  'line feed'

10  'line feed'
91  [

69  E
118 v

101 e
110 n

116 t
32  'space'

34  "
72  H

97  a
114 r

118 v
97  a

114 r
100 d

32  'space'
67  C

117 u
112 p

32  'space'
51  3

48  0
180 'this is not ascii'

34  "
93  ]

10  'line feed'
46  .

46  .
46  .

您在使用ASCII打开时遇到的问题是十进制值为180的字节。 ASCII只能达到127。所以我认为这一定是某种扩展的ASCII，其中128-255用于额外的符号。在阅读了关于ASCII的维基百科文章（https://en.wikipedia.org/wiki/ASCII）后，它提到了一个叫做windows-1252的流行ASCII扩展。在windows-1252中，十进制值180映射到重音符号（´）。然后我决定在谷歌中搜索您文件中的字符串以了解其实际相关性。这就是当我找到“Harvard Cup 30´” http://www.365chess.com/tournaments/Harvard_Cup_30%C2%B4_1989/21650时。

因此，正确的编码可能是windows-1252。这是我的测试程序：-

import codecs
with codecs.open('f.txt', 'r', encoding='windows-1252') as f:
    print f.read()

输出

... 0-1

[Event "Harvard Cup 30´"]
...