如何在Python中知道文件的编码？

Question

如何在Python中知道文件的编码？

29

请问有谁知道如何在Python中获取文件的编码格式。我知道可以使用codecs模块来打开以特定编码格式保存的文件，但需要事先知道该文件的编码格式。

import codecs
f = codecs.open("file.txt", "r", "utf-8")

有没有一种自动检测文件使用的编码方式的方法？

谢谢。

编辑：感谢大家提供非常有趣的答案。你可能会对http://whatismyencoding.com/感兴趣，它基于chardet（此外，该站点由bottle python框架提供支持）。

- luc

5个回答

8

您可以使用BOM (http://en.wikipedia.org/wiki/Byte_order_mark) 来检测编码，或者尝试使用这个库：

https://github.com/chardet/chardet

- ZelluX

5

这里有一个小片段可以帮助您猜测编码。它在Latin1和UTF8之间有很好的猜测能力。它将字节字符串转换为Unicode字符串。

# Attention: Order of encoding_guess_list is import. Example: "latin1" always succeeds.
encoding_guess_list=['utf8', 'latin1']
def try_unicode(string, errors='strict'):
    if isinstance(string, unicode):
        return string
    assert isinstance(string, str), repr(string)
    for enc in encoding_guess_list:
        try:
            return string.decode(enc, errors)
        except UnicodeError, exc:
            continue
    raise UnicodeError('Failed to convert %r' % string)
def test_try_unicode():
    for start, should in [
        ('\xfc', u'ü'),
        ('\xc3\xbc', u'ü'),
        ('\xbb', u'\xbb'), # postgres/psycopg2 latin1: RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
        ]:
        result=try_unicode(start, errors='strict')
        if not result==should:
            raise Exception(u'Error: start=%r should=%r result=%r' % (
                    start, should, result))

- guettli

我对此进行了简化和调整，仅使用try-except中的.decode()，在（1）成功转换或（2）耗尽encoding_guess_list后退出。如果最终失败，则应用不同的.decode()，将错误设置为“replace”而不是“strict”。 - JDM

4

这里有 Unicode Dammit，它来自于 Beautiful Soup，使用 chardet 并添加了一些额外的功能。

它尝试从XML或HTML文件中读取编码。然后它尝试在文件开头查找BOM或类似的内容。如果无法执行该操作，则使用 chardet。

- Craig McQueen

1

#!/usr/bin/python

"""
Line by line detecting encoding if input and then convert it into UTF-8
Suitable for look at logs with mixed encoding (i.e. from mail systems)

"""

import sys
import chardet

while 1:
        l = sys.stdin.readline()
        e = chardet.detect(l)

        u = None
        try:
                if e['confidence'] > 0.3:
                        u = unicode(l, e['encoding'])
        except:
                pass

        if u:
                print u,
        else:
                print l,

- Vladimir Grebenschikov

1

编码并不总是可以在中间行（例如BOM）中检测到。而且，在文件中间改变编码的有趣高级情况也很少见。 - Noein

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- HS. · Accepted Answer

很不幸，无法通过查看文件本身来确定文件编码的“正确”方法。这是一个普遍问题，不仅限于Python或任何特定文件系统。

如果您正在阅读XML文件，则文件中的第一行可能会给您一个提示编码是什么。

否则，您将需要使用一些启发式的方法，如chardet（在其他答案中提供的解决方案之一），它尝试通过检查原始字节格式的文件中的数据来猜测编码。如果您在Windows上，我相信Windows API还公开了基于文件中的数据尝试猜测编码的方法。