在Python中无误转换Unicode为ASCII

Question

在Python中无误转换Unicode为ASCII

pythonunicodeutf-8character-encodingascii

202

我的代码只是爬取一个网页，然后将其转换为Unicode编码。

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

但是我收到了一个UnicodeDecodeError错误：

Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

我猜这意味着HTML中包含一些错误的Unicode尝试。我能否只删除导致问题的代码字节而不是得到错误？

- themirror

似乎您在网页中遇到了“不间断空格”？需要在其前面加上c2字节，否则可能会出现解码错误：http://hexutf8.com/?q=C2A0 - jar

1

这个问题的标题应该进行修改，以表明它特别涉及解析HTML请求的结果，而不是关于“在Python中无误地将Unicode转换为ASCII”。 - MRule

提醒任何使用类似于\x1b[38;5;226m...的文本的人，这是ansi转义码，而不是Unicode。 - SurpriseDog

12个回答

139

作为 Ignacio Vazquez-Abrams 答案的扩展

>>> u'aあä'.encode('ascii', 'ignore')
'a'

有时候需要从字符中去掉重音并打印出基本形式。可以通过以下方式实现:

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

你可能还想将其他字符（比如标点符号）转换为它们最接近的等价物，例如当进行编码时，右单引号unicode字符不会被转换为ASCII撇号。

>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

虽然有更有效的方法来完成这个任务。请参见此问题获取更多细节。

- Peter Gibson

5

这是一个样例答案，既有助于回答所提出的问题，又实用于解决可能潜在的问题。 - shanusmagnus

112

使用 unidecode - 它能快速将奇怪的字符转换为 ASCII，甚至可以将中文转换为音标 ASCII。

$ pip install unidecode

那么：

>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

- Nimo

7

哈利路亚——我终于找到了适合我的答案。 - Aurielle Perlmann

15

因为好玩而点赞。请注意，这会扭曲所有带重音的语言中的单词。 Škoda 不是 Skoda。Skoda 可能意味着一些关于鳗鱼和气垫船的令人不快的事情。 - Sylvain

1

我已经在网上搜寻了好几天，直到现在...谢谢，非常感谢。 - Stephen

110

2018年更新：

截至2018年2月，使用像gzip这样的压缩方式已经变得相当流行（约73%的网站使用它，包括Google、YouTube、Yahoo、Wikipedia、Reddit、Stack Overflow和Stack Exchange Network等大型网站）。
如果你对一个已经被gzipped的响应进行简单的解码，你会得到一个类似于以下的错误信息：

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte

为了解码一个被gzip压缩过的响应，在Python 3中，你需要添加以下模块：

import gzip
import io

注意: 在Python 2中，您需要使用StringIO而不是io

然后您可以像这样解析内容：

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

此代码读取响应，并将字节放入缓冲区。然后，gzip模块使用GZipFile函数读取缓冲区中的内容。然后，压缩文件可以再次读入字节并在最终解码为可读文本。

2010年的原始答案：

我们能否得到实际用于link的值？

此外，当我们尝试对已经编码的字节字符串进行.encode()时，通常会在这里遇到此问题。因此，您可以首先对其进行解码，例如：

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

作为一个例子：

html = '\xa0'
encoded_str = html.encode("utf8")

出现错误

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

当:

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

成功执行而没有错误。请注意，“windows-1252”只是我用作示例的内容。我从chardet获取了这个值，并且它有0.5的置信度是正确的！（好吧，因为给出的是一个1个字符长度的字符串，你期望什么呢）您应该将其更改为从.urlopen().read()返回的字节字符串的编码适用于您检索到的内容。

我看到的另一个问题是.encode()字符串方法返回修改后的字符串，而不是在原地修改源。因此，将self.response.out.write(html)作为html无法成为来自html.encode的编码字符串（如果这是您最初的目标）。

正如Ignacio建议的那样，请检查源网页以获取从read()返回的实际编码方式。它可以在Meta标签中或响应的ContentType头中。然后将其用作.decode()的参数。

但请注意，不应假定其他开发人员足够负责，以确保标题和/或元字符集声明与实际内容匹配。（这很麻烦，是的，我应该知道，我以前就是其中之一）。

- Vin-G

1

在你的例子中，我认为你想让最后一行是encoded_str = decoded_str.encode("utf8") - Ajith Antony

1

我尝试在Python 2.7.15中运行，但是出现了这个错误信息“raise IOError, 'Not a gzipped file'”。我做错了什么？ - Hyun-geun Kim

25

我在我的所有项目中都使用这个辅助函数。如果它不能转换Unicode，则会忽略它。这与Django库相关联，但通过一些研究，您可以绕过它。

from django.utils import encoding

def convert_unicode_to_string(x):
    """
    >>> convert_unicode_to_string(u'ni\xf1era')
    'niera'
    """
    return encoding.smart_str(x, encoding='ascii', errors='ignore')

使用这个后，我不再遇到任何Unicode错误。

- Gattster

10

这是“压制问题”，而不是诊断和解决问题。这就像说：“我把我的脚砍掉后，就不再有鸡眼和茧子的问题了”。 - John Machin

10

我同意这种方式只是压制问题。不过，看起来这正是问题所在。看看他的备注：“我是否可以丢弃导致问题的任何代码字节而不会出现错误？” - Gattster

3

这和直接调用 "some-string".encode('ascii', 'ignore') 是一模一样的。 - Joshua Burns

17

我无法告诉你有多烦人，每当有人在SO上问一个问题时，就会得到一堆说教式的回答。比如：“我的车启动不了。”然后就会有人回答：“为什么你要开车呢？你应该走路。”请停止这种行为！ - shanusmagnus

3

在一些非常真实、涉及巨额资金的项目中，有很多实际的商业案例表明，删除这些字符是完全可以的。请注意，这不会改变原意。 - Yablargo

显示剩余9条评论

11

对于像 cmd.exe 这样的破损的控制台和 HTML 输出，您始终可以使用：

my_unicode_string.encode('ascii','xmlcharrefreplace')

这将保留所有非ASCII字符，同时使它们在纯ASCII和HTML中可打印。

警告：如果您在生产代码中使用此方法以避免错误，则很可能是您的代码存在问题。 唯一有效的用例是将字符打印到非Unicode控制台或在HTML上下文中轻松转换为HTML实体。

最后，如果您使用的是Windows和cmd.exe，则可以键入chcp 65001来启用UTF-8输出（适用于Lucida Console字体）。您可能需要添加myUnicodeString.encode('utf8')。

- ccpizza

7

您写道：“我认为这意味着HTML中包含了一些错误形式的Unicode尝试。”

HTML不应该包含任何形式的“尝试使用Unicode”，无论是格式正确还是不正确。它必须包含以某种编码方式编码的Unicode字符，通常在开头提供...查找“charset”。

您似乎假定字符集为UTF-8...根据什么？您错误信息中显示的“\xA0”字节表明您可能具有单字节字符集，例如cp1252。

如果您无法从HTML开头的声明中获得任何信息，请尝试使用chardet查找可能的编码。

为什么您将问题标记为“regex”？

更新：在您用非问题替换整个问题后：

html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.

html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using 
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't 
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object 
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

- John Machin

5

如果您有一个字符串line，您可以使用字符串的.encode([encoding], [errors='strict'])方法来转换编码类型。 line = 'my big string' line.encode('ascii', 'ignore') 关于在Python中处理ASCII和Unicode的更多信息，请参考这个非常有用的网站：https://docs.python.org/2/howto/unicode.html。

- Jama22

1

当字符串中存在非 ASCII 字符，如 ü 时，此方法无法正常工作。 - sajid

5

我认为答案已经存在，但只是零散的碎片，这使得快速解决问题变得困难，例如：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

让我们来举个例子，假设我有一个文件，其中包含以下形式的一些数据（包含ascii和非ascii字符）

1/10/17, 21:36 - Land : Welcome ï¿½ï¿½

我们想要忽略并保留只有ascii字符。

以下代码将实现此操作：

import unicodedata
fp  = open(<FILENAME>)
for line in fp:
    rline = line.strip()
    rline = unicode(rline, "utf-8")
    rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
    if len(rline) != 0:
        print rline

而 type(rline) 将会给你：

>type(rline) 
<type 'str'>

- Somum

这也适用于（非标准化的）“扩展ASCII”情况。 - Oliver Zendel

1

您可以使用以下代码片段作为示例，以避免Unicode到ASCII错误：

from anyascii import anyascii

content = "Base Rent for – CC# 2100 Acct# 8410: $41,667.00 – PO – Lines - for Feb to Dec to receive monthly"
content = anyascii(content)
print(content)

- biplabks

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ignacio Vazquez-Abrams · Accepted Answer

>>> u'aあä'.encode('ascii', 'ignore')
'a'

根据响应中适当的meta标签或Content-Type头部中的字符集解码接收到的字符串，然后进行编码。

encode(encoding, errors)方法接受自定义错误处理程序。除了ignore之外，默认值还有:

>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'a&#12354;&#228;'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'

请查看https://docs.python.org/3/library/stdtypes.html#str.encode