Python：使用unidecode解决Unicode困境

Question

12

我一直在研究将文本转换成ASCII码的方法。因此，ā会变成a，ñ会变成n等等。

unidecode对此非常有帮助。

# -*- coding: utf-8 -*-
from unidecode import unidecode
print(unidecode(u"ā, ī, ū, ś, ñ"))
print(unidecode(u"Estado de São Paulo"))

生成：

a, i, u, s, n
Estado de Sao Paulo

然而，我无法使用输入文件中的数据复制此结果。

test.txt 文件的内容：

ā, ī, ū, ś, ñ
Estado de São Paulo

# -*- coding: utf-8 -*-
from unidecode import unidecode
with open("test.txt", 'r') as inf:
    for line in inf:
        print unidecode(line.strip())

产生：

A, A<<, A<<, A, A+-
Estado de SAPSo Paulo

并且：

运行时警告：参数不是Unicode对象。传递编码字符串可能会产生意外的结果。

问题：我如何将这些行读取为Unicode，以便我可以将它们传递给unidecode?

- e h

3

为什么会是“Unicode 地狱”？那些带重音符号的字符本来就很好。如果它们被毁坏到无法修复的地步，那才是真正的“地狱”（有人可能会认为你的解决方案实际上做了这件事）。 - tripleee

4

我同意。这些是一流的角色，我为碾压他们感到非常内疚，但那就是我所做的。好消息是我将有时间在ASCII炼狱中思考。 - e h

2个回答

5

import codecs
with codecs.open('test.txt', encoding='whicheveronethefilewasencodedwith') as f:
    ...

codecs模块提供了一些函数，可以自动进行Unicode编码/解码等操作，包括打开文件。

- user2357112

谢谢。两个答案都很完美，选择了 Mark 因为他回答得更快。 - e h

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mark Ransom · Accepted Answer

使用 codecs.open 函数。

with codecs.open("test.txt", 'r', 'utf-8') as inf:

编辑：上面的内容适用于Python 2.x。对于Python 3，您不需要使用codecs，因为编码参数已经添加到普通的open中。

with open("test.txt", 'r', encoding='utf-8') as inf: