在Python中无误转换Unicode为ASCII

Question

在Python中无误转换Unicode为ASCII

pythonunicodeutf-8character-encodingascii

202

我的代码只是爬取一个网页，然后将其转换为Unicode编码。

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

但是我收到了一个UnicodeDecodeError错误：

Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

我猜这意味着HTML中包含一些错误的Unicode尝试。我能否只删除导致问题的代码字节而不是得到错误？

- themirror

似乎您在网页中遇到了“不间断空格”？需要在其前面加上c2字节，否则可能会出现解码错误：http://hexutf8.com/?q=C2A0 - jar

1

这个问题的标题应该进行修改，以表明它特别涉及解析HTML请求的结果，而不是关于“在Python中无误地将Unicode转换为ASCII”。 - MRule

提醒任何使用类似于\x1b[38;5;226m...的文本的人，这是ansi转义码，而不是Unicode。 - SurpriseDog

12个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- HimalayanCoder · Answer 1

unicodestring = '\xa0'

decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')

适合我。

- Haroon Rashedu · Answer 2

看起来你正在使用 Python 2.x 版本。 Python 2.x 默认使用 ASCII 编码，不支持 Unicode，因此会出现异常。

在 shebang 后面粘贴以下行，就可以解决问题了。

# -*- coding: utf-8 -*-