BeautifulSoup 解码错误

Question

BeautifulSoup 解码错误

9

我将尝试使用Beautiful Soup解析Evernote生成的HTML文件。以下是代码：

html = open('D:/page.html', 'r')
soup = BeautifulSoup(html)

它给出以下错误：

文件“C：\ Python33 \ lib \ site-packages \ bs4 \ __ init__.py”的第161行中，markup = markup.read（）文件“C：\ Python33 \ lib \ encodings \ cp1252.py”的第23行中，decode的def（输入 self.errors，decoding_table）[0] UnicodeDecodeError：'charmap'编解码器无法在位置24274解码字节0x9d：字符映射为< undefined>

如何解决这个问题？

- bhavesh

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

将编码的字节字符串甚至是以二进制模式打开的文件对象传递给BeautifulSoup，它会自动处理解码：

with open('D:/page.html', 'rb') as html:
    soup = BeautifulSoup(html)

BeautifulSoup在文档本身中查找HTML元数据（例如具有charset属性的<meta>标签以解码文档；如果没有，将使用chardet库来猜测所使用的编码格式。 chardet使用关于字节序列的启发式和统计信息，为BeautifulSoup提供最可能的编解码器。

如果您有更多上下文并已经知道要使用的正确编解码器，请使用from_encoding参数传递它:

with open('D:/page.html', 'rb') as html:
    soup = BeautifulSoup(html, from_encoding=some_explicit_codec)

请参阅文档中的编码部分。