如何对UTF-8字节数组进行宽容解码？

Question

如何对UTF-8字节数组进行宽容解码？

pythonpython-3.xutf-8error-handlingdecode

6

我需要将存储在字节数组中的UTF-8序列解码为字符串。

UTF-8序列可能包含错误部分。在这种情况下，我需要尽可能解码，并（可选？）将无效部分替换为类似“?”的内容。

# First part decodes to "ABÄC"
b = bytearray([0x41, 0x42, 0xC3, 0x84, 0x43])
s = str(b, "utf-8") 
print(s)

# Second part, invalid sequence, wanted to decode to something like "AB?C"
b = bytearray([0x41, 0x42, 0xC3, 0x43])
s = str(b, "utf-8")
print(s)

如何用Python 3最佳地实现这一点？

- Joe

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Zero Piraeus · Accepted Answer

在编码和解码str与bytes和bytearray之间时，有几种内置的错误处理方案，例如使用bytearray.decode()。例如：

>>> b = bytearray([0x41, 0x42, 0xC3, 0x43])

>>> b.decode('utf8', errors='ignore')  # discard malformed bytes
'ABC'

>>> b.decode('utf8', errors='replace')  # replace with U+FFFD
'AB�C'

>>> b.decode('utf8', errors='backslashreplace')  # replace with backslash-escape
'AB\\xc3C'

此外，您可以编写自己的错误处理程序并注册它：

import codecs

def my_handler(exception):
    """Replace unexpected bytes with '?'."""
    return '?', exception.end

codecs.register_error('my_handler', my_handler)

>>> b.decode('utf8', errors='my_handler')
'AB?C'

所有的这些错误处理方案也可以与str()构造函数一起使用，就像你的问题中所示：

>>> str(b, 'utf8', errors='my_handler')
'AB?C'

虽然显式使用 str.decode() 更符合惯用语，但是也可以这样写。