从Python中任意给定的字符串类型中删除非ASCII字符

Question

从Python中任意给定的字符串类型中删除非ASCII字符

pythonstringunicodereplacenon-ascii-characters

4

>>> teststring = 'aõ'
>>> type(teststring)
<type 'str'>
>>> teststring
'a\xf5'
>>> print teststring
aõ
>>> teststring.decode("ascii", "ignore")
u'a'
>>> teststring.decode("ascii", "ignore").encode("ascii")
'a'

当我删除非ASCII字符时，我希望它内部存储的是这样一个字符串。为什么decode("ascii")会输出一个Unicode字符串？

>>> teststringUni = u'aõ'
>>> type(teststringUni)
<type 'unicode'>
>>> print teststringUni
aõ
>>> teststringUni.decode("ascii" , "ignore")

Traceback (most recent call last):
  File "<pyshell#79>", line 1, in <module>
    teststringUni.decode("ascii" , "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
>>> teststringUni.decode("utf-8" , "ignore")

Traceback (most recent call last):
  File "<pyshell#81>", line 1, in <module>
    teststringUni.decode("utf-8" , "ignore")
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
>>> teststringUni.encode("ascii" , "ignore")
'a'

这正是我想要的。我不理解这种行为。有人能向我解释一下这里发生了什么吗？

编辑：我以为这会帮助我理解事情，从而解决我在这里陈述的真正程序问题：将具有非ASCII符号的Unicode对象转换为字符串对象（在Python中）

- fullmooninu

2个回答

4

简单来说：.encode 将 Unicode 对象转换为字符串，而 .decode 则将字符串转换为 Unicode。

- Ned Batchelder

如果这不起作用，还可以尝试使用BeautifulSoup(html).encode来处理html或regex模块。 - Andrew Scott Evans

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Daniel Roseman · Accepted Answer

为什么使用decode("ascii")会得到一个Unicode字符串？

因为这就是decode的作用：将字节串（如你的ASCII字节串）解码成Unicode。

在你的第二个示例中，你试图对已经是Unicode的字符串进行"解码"，这没有任何效果。但是为了将它打印到你的终端上，Python必须将其编码为默认编码，即ASCII——但由于你没有显式执行此步骤，因此也没有指定'ignore'参数，所以它会引发无法对非ASCII字符进行编码的错误。

所有这一切的诀窍在于记住decode接受一个已编码的字节串并将其转换为Unicode，而encode则相反。如果你理解Unicode不是一种编码，那么这可能会更容易些。