解烤乱码问题

Question

解烤乱码问题

pythonunicodecharacter-encodingdecodingmojibake

4

当您遇到解码错误的字符时，如何确定可能是原始字符串的候选项？

Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png

我知道这个图片文件名本应该是一些日文字符。但是在尝试使用urllib进行引用/取消引用、对iso8859-1和utf8进行编码和解码的过程中，我无法还原并获得原始文件名。

这种损坏是否可以逆转？

- wim

我很佩服你能从那些乱码中看出它是日语。 - Burhan Khalid

这并不是来自于那些胡言乱语本身。我从接收到这些胡言乱语的上下文中知道了这一点。 - wim

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- galinden · Accepted Answer

你可以使用 chardet（使用 pip 安装）：

import chardet

your_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb"
detected_encoding = chardet.detect(your_str)["encoding"]

try:
    correct_str = your_str.decode(detected_encoding)
except UnicodeDecodeError:
    print("Could not estimate encoding")

结果：时间测试视角（Anime Pass）_10秒（不确定是否正确）

对于Python 3（源文件编码为utf8）：

import chardet
import codecs

falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"

try:
    encoded_str = falsely_decoded_str.encode("cp850")
except UnicodeEncodeError:
    print("could not encode falsely decoded string")
    encoded_str = None

if encoded_str:
    detected_encoding = chardet.detect(encoded_str)["encoding"]

    try:
        correct_str = encoded_str.decode(detected_encoding)
    except UnicodeEncodeError:
        print("could not decode encoded_str as %s" % detected_encoding)

    with codecs.open("output.txt", "w", "utf-8-sig") as out:
        out.write(correct_str)

总之，以下是要点：

>>> s = 'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'
>>> s.encode('cp850').decode('shift-jis')
'時間試験観点（アニメパス）_10秒.png'