Python编码/解码问题

Question

Python编码/解码问题

pythonpython-2.7encodingasciinon-ascii-characters

7

如何将像“weren’t”这样的字符串解码回正常编码？

所以这个单词实际上是“weren't”，而不是“weren’t”？例如：

print "\xe2\x80\x9cThings"
string = "\xe2\x80\x9cThings"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')

â€œThings
“Things
Things

但实际上我想获取“事物”。

或：

print "weren\xe2\x80\x99t"
string = "weren\xe2\x80\x99t"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')

werenâ€™t
weren’t
werent

但实际上我想得到“不是”的结果。

我该怎么做？

- Brana

1

您需要提供所需的翻译字典，例如从花式引号到普通ASCII引号，并使用Unicode字符串的“translate”方法来应用它。我认为没有标准的“转换成ASCII字符”的翻译字典。 - Alex Martelli

我刚刚做了一个 :) - Brana

3个回答

4

在 Python 3 中，我会这样做：

string = "\xe2\x80\x9cThings"
bytes_string = bytes(string, encoding="raw_unicode_escape")
happy_result = bytes_string.decode("utf-8", "strict")
print(happy_result)

无需翻译映射，只需要编写代码 :)

- Wim Feijen

我一直在寻找这个答案！ - AKMalkadi

有没有适用于Python 2.7.5的解决方案？ - undefined

嗨 @SudiptaRoy 你有可能升级到Python 3.x吗？如果可以的话，我强烈建议这样做。我没有Python 2.7.5，但我非常有把握以下代码会起作用。不能保证一定成功，但还是抱着好运的心态吧！string = u"\xe2\x80\x9cThings"; bytes_string = str(string, encoding="raw_unicode_escape"); print(happy_result) - undefined

1

你应该提供一个翻译映射表，将Unicode字符映射到其他Unicode字符（后者应在ASCII范围内，如果您想重新编码为它）。

uni2ascii = {ord('\xe2\x80\x99'.decode('utf-8')): ord("'")}    
yourstring.decode('utf-8').translate(uni2ascii).encode('ascii')
print(yourstring)  # prints: "weren't"

- Oliver W.

我知道我可以做到这一点。但是有没有一个现成的地图可以自动完成这个任务？ - Brana

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Brana · Accepted Answer

我映射了最常见的奇怪字符，因此这基本上是基于 Oliver W. 答案的完整答案。

这个函数并不是理想的，但它是开始的最佳选择。还有更多的字符定义：

http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&names=-&utf8=string-literal

...

def unicodetoascii(text):

    uni2ascii = {
            ord('\xe2\x80\x99'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9d'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9e'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9f'.decode('utf-8')): ord('"'),
            ord('\xc3\xa9'.decode('utf-8')): ord('e'),
            ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x93'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x92'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x98'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\x9b'.decode('utf-8')): ord("'"),

            ord('\xe2\x80\x90'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x91'.decode('utf-8')): ord('-'),

            ord('\xe2\x80\xb2'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb3'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb4'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb5'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb6'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb7'.decode('utf-8')): ord("'"),

            ord('\xe2\x81\xba'.decode('utf-8')): ord("+"),
            ord('\xe2\x81\xbb'.decode('utf-8')): ord("-"),
            ord('\xe2\x81\xbc'.decode('utf-8')): ord("="),
            ord('\xe2\x81\xbd'.decode('utf-8')): ord("("),
            ord('\xe2\x81\xbe'.decode('utf-8')): ord(")"),

                            }
    return text.decode('utf-8').translate(uni2ascii).encode('ascii')

print unicodetoascii("weren\xe2\x80\x99t")