您的数据文件中有UTF-8 BOM;这是我的Python 2交互会话所述的被转换为浮点数的内容:
>>> '0'
'\xef\xbb\xbf0'
"
\xef\xbb\xbf
字节是 UTF-8 编码的 U+FEFF ZERO WIDTH NO-BREAK SPACE,通常作为字节顺序标记使用,特别是由 Microsoft 产品使用。UTF-8 没有字节顺序问题,该标记不需要记录像 UTF-16 或 UTF-32 那样的字节顺序;相反,Microsoft 使用它来检测编码。
在 Python 3 中,您可以使用 utf-8-sig
编解码器打开文件;此编解码器期望 BOM 在开头并将其删除:
"
infile = open('text', 'r', encoding='utf-8-sig')
在Python 2中,您可以使用
codecs.BOM_UTF8
常量来检测和删除BOM。
for line in infile:
if line.startswith(codecs.BOM_UTF8):
line = line[len(codecs.BOM_UTF8):]
x, y = line.split()
正如codecs
文档所解释的那样:
As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF
character in the decoded string (even if it’s the first character) is treated as a ZERO WIDTH NO-BREAK SPACE
.
Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig"
) for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef
, 0xbb
, 0xbf
) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
INVERTED QUESTION MARK
in iso-8859-1), this increases the probability that a utf-8-sig
encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig
codec will write 0xef
, 0xbb
, 0xbf
as the first three bytes to the file. On decoding utf-8-sig
will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.