"utf-8" 编码无法解码字节 0x80

Question

"utf-8" 编码无法解码字节 0x80

13

我正在尝试下载BVLC训练的模型，但遇到了这个错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 110: invalid start byte

我认为是因为以下函数（完整代码）。

  # Closure-d function for checking SHA1.
  def model_checks_out(filename=model_filename, sha1=frontmatter['sha1']):
      with open(filename, 'r') as f:
          return hashlib.sha1(f.read()).hexdigest() == sha1

有什么想法可以修复这个问题吗？

- Ehab AlBadawy

错误信息非常清晰。要么你的文件根本不是UTF8格式，要么它已经损坏了。 - Jongware

当我尝试打印f时，我得到了以下内容：<_io.TextIOWrapper name='models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel' mode='r' encoding='utf8'> - Ehab AlBadawy

我尝试使用 with open(filename, 'r', encoding='utf8') as f: 修改第二行，但是我得到了相同的错误。 - Ehab AlBadawy

不要告诉Python它是UTF8编码，除非你确定它应该是。但是如果Python告诉你它不是有效的UTF8编码，而是其他编码，请使用好的代码编辑器打开文件并查看其中内容。 - Jongware

很高兴你发现这两个答案都有帮助！请注意，你只能标记一个答案为被采纳的答案；选择完全取决于你。 :-) - Martijn Pieters

显示剩余5条评论

3个回答

5

你没有指定以二进制模式打开文件，所以f.read()试图将文件作为UTF-8编码的文本文件读取，但似乎并没有起作用。但由于我们只处理字节的哈希值，而不是字符串，所以编码是什么甚至是否是文本文件都无关紧要：只需以二进制文件形式打开文件，然后读取它。

>>> with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
Traceback (most recent call last):
  File "<ipython-input-3-fdba09d5390b>", line 1, in <module>
    with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
  File "/home/dsm/sys/pys/Python-3.5.1-bin/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 10: invalid start byte

但是。

>>> with open("test.h5.bz2","rb") as f: print(hashlib.sha1(f.read()).hexdigest())
21bd89480061c80f347e34594e71c6943ca11325

- DSM

3

由于文档和源代码中没有任何提示，我不知道为什么，但是使用b字符（我猜测是二进制）完全有效（tf-version: 1.1.0）：

image_data = tf.gfile.FastGFile(filename, 'rb').read()

更多信息请查看：gfile

- 4F2E4A2E

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

您正在打开一个非 UTF-8 编码的文件，而您系统的默认编码设置为 UTF-8。

由于您正在计算 SHA1 哈希值，因此应该将数据读取为 二进制数据。 hashlib 函数要求您传入字节：

with open(filename, 'rb') as f:
    return hashlib.sha1(f.read()).hexdigest() == sha1

请注意文件模式中添加了b。

参见open（）文档：

mode是一个可选字符串，用于指定打开文件的模式。默认为'r'，这意味着以文本模式打开进行读取。[...]在文本模式下，如果未指定encoding，则使用的编码因平台而异：调用locale.getpreferredencoding(False)以获取当前语言环境的编码。（对于读写原始字节，请使用二进制模式并将encoding未指定。）

并从hashlib模块文档中得知：

您现在可以使用update()方法将类似字节的对象(通常是字节）提供给此对象。