Python 3 UnicodeDecodeError: 'charmap'编解码器无法解码字节0x9d

Question

Python 3 UnicodeDecodeError: 'charmap'编解码器无法解码字节0x9d

50

我想制作搜索引擎，并在一些网站上遵循教程。我想测试解析HTML。

from bs4 import BeautifulSoup

def parse_html(filename):
    """Extract the Author, Title and Text from a HTML file
    which was produced by pdftotext with the option -htmlmeta."""
    with open(filename) as infile:
        html = BeautifulSoup(infile, "html.parser", from_encoding='utf-8')
        d = {'text': html.pre.text}
        if html.title is not None:
            d['title'] = html.title.text
        for meta in html.findAll('meta'):
            try:
                if meta['name'] in ('Author', 'Title'):
                    d[meta['name'].lower()] = meta['content']
            except KeyError:
                continue
        return d

parse_html("C:\\pdf\\pydf\\data\\muellner2011.html")

它出现了错误

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 867: character maps to <undefined>enter code here

我在网上看到一些使用encode()函数的解决方案。但我不知道如何在代码中插入encode()函数。有人可以帮助我吗？

- Fakhriyanto

1

异常的完整回溯是什么？ - Martijn Pieters

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

110

在Python 3中，文件会被以文本形式打开（解码为Unicode），你不需要告诉BeautifulSoup从哪个编码解码。如果数据的解码失败，那是因为你在读取文件时没有告诉open()调用使用什么编码，需添加正确的编码作为一个encoding参数。

with open(filename, encoding='utf8') as infile:
    html = BeautifulSoup(infile, "html.parser")

否则该文件将使用您系统默认的编解码器打开，这取决于操作系统。

- Martijn Pieters

2

你也可以在open()中添加errors='ignore'，以防文件不是'utf-8'格式并且你想跳过非utf8字节，避免出现“UnicodeDecodeError: 'utf-8' codec can't decode byte”错误。来源：https://dev59.com/1FMI5IYBdhLWcg3wV6Br - Altair7852

@Altair7852 那是一个冒险的选项，只有在你的输入是其他ASCII超集编解码时才有效。 - Martijn Pieters

@Altair7852，你链接的帖子是关于阅读PDF文件的，它甚至不是一个文本文件，而是一个二进制格式。将其作为文本打开是错误的做法。 - Martijn Pieters

Martijn Pieters，你是正确的，链接的帖子在这里并不是很相关，除了标志之外，是的 - 只有在你知道自己在做什么时才使用它。为了辩护，当我读取HTML文件时，我遇到了UTF8问题，因此发表了评论。 - Altair7852

我一直出现相同错误，因为我的配置文件包含几个中文字符。我在配置读取函数中添加了 'utf-8' 编码。下面是代码。config.read('../conf/PM_AutomaticTariffUpload_Converter.conf',encoding = 'utf8') - Mantu