Python UnicodeEncodeError > 如何简单地移除烦人的Unicode字符？

Question

Python UnicodeEncodeError > 如何简单地移除烦人的Unicode字符？

6

这是我所做的事情...

>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128)
>>> 
>>> soup.find('div')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128)
>>> 
>>> soup.find('span')
<span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span>
>>>

我该如何简单地从 html 中删除烦人的 Unicode 字符？或者有没有更干净的解决方案？

- Nullpoet

4个回答

2

您看到的错误是由于repr（soup）试图混合Unicode和字节串。混合Unicode和字节串经常会导致错误。

比较：

>>> u'1' + '©'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

并且：

>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'

这里有一个关于类的例子：

>>> class A:
...     def __repr__(self):
...         return u'copyright ©'.encode('utf-8')
... 
>>> A()
copyright ©
>>> class B:
...     def __repr__(self):
...         return u'copyright ©'
... 
>>> B()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
...     def __repr__(self):
...         return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)

类似的事情也会发生在 Beautiful Soup 中：

>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)

为了解决这个问题：

>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'

- jfs

1

首先，“麻烦”的Unicode字符可能是某种语言的字母，但假设您不必担心非英文字符，则可以使用Python库将Unicode转换为ANSI。请查看此问题的答案：如何使用Python将文件格式从Unicode转换为ASCII？那里的被接受的答案似乎是一个很好的解决方案（我之前不知道）。

- Karim

那个解决方案对我不起作用，因为HTML不是Unicode，它只是字符串。[>>> unicodedata.normalize('NFKD', html).encode('ascii','ignore')跟踪（最近的调用最先）：文件“<stdin>”，第1行，在<module>中：类型错误：normalize()参数2必须是Unicode，而不是str] - Nullpoet

0

我曾经遇到过同样的问题，花了几个小时才解决。注意到当解释器需要显示内容时会出现错误，这是因为解释器试图转换为ASCII码，导致问题。看一下这里的最佳答案：

UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

- SnowFrogger

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- esv · Accepted Answer

10

可以尝试这个方法: soup = BeautifulSoup(html.decode('utf-8', 'ignore'))

- esv

没成功！这是发生的事情... >>> html.decode('utf-8', 'strip') 回溯（最近的调用最先）：..... LookupError: 未知的错误处理程序名称“strip”

html.decode('utf-8') 回溯（最近的调用最先）：..... UnicodeDecodeError: 'utf8'编解码器无法在位置98071解码字节0xae：意外的代码字节

- Nullpoet

1

非常抱歉，应该使用“忽略”而不是“剥离”。另外，我建议阅读Unicode HOWTO文档http://docs.python.org/howto/unicode.html。 - esv