UnicodeEncodeError: 'ascii'编解码器无法对字符u'\u2026'进行编码

Question

UnicodeEncodeError: 'ascii'编解码器无法对字符u'\u2026'进行编码

pythonpython-2.7unicodebeautifulsoupurllib2

29

我正在学习urllib2和Beautiful Soup，但在第一次测试时出现了错误，例如：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)

似乎有很多关于这种错误的帖子，我已经尝试了我能理解的解决方案，但它们似乎存在一些困难，例如：

我想打印post.text（其中text是一个返回文本的beautiful soup方法）。 str(post.text)和post.text会产生unicode错误（在像右撇号'和...这样的字符上）。

因此，在str(post.text)上面添加post = unicode(post)，然后我得到：

AttributeError: 'unicode' object has no attribute 'text'

我也尝试了(post.text).encode()和(post.text).renderContents()。后者产生了错误：

AttributeError: 'unicode' object has no attribute 'renderContents'

然后我尝试使用str(post.text).renderContents()，但出现了错误：

AttributeError: 'str' object has no attribute 'renderContents'

如果我可以在文档顶部定义'使这个内容可解释'，并且仍然可以访问所需的text函数，那就太好了。

更新: 经过建议：

如果我在str(post.text)之前添加post = post.decode("utf-8")，我会得到：

TypeError: unsupported operand type(s) for -: 'str' and 'int'

如果我在 str(post.text) 上面添加 post = post.decode()，那么我会得到：

AttributeError: 'unicode' object has no attribute 'text'

如果我在 (post.text) 上方添加 post = post.encode("utf-8")，那么我会得到以下结果：

AttributeError: 'str' object has no attribute 'text'

我尝试了 print post.text.encode('utf-8') 并得到了以下结果：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)

为了尝试可能有效的事情，我从这里安装了适用于Windows的lxml，并通过以下方式实现：

parsed_content = BeautifulSoup(original_content, "lxml")

根据http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters的介绍，这些步骤似乎没有什么区别。我正在使用Python 2.7.4和Beautiful Soup 4。

解决方案：

深入了解unicode、utf-8和Beautiful Soup类型后，发现与我的打印方法有关。我删除了所有的str方法和连接操作，例如str(something) + post.text + str(something_else)，改为something, post.text, something_else ，看起来打印效果很好，但在这个阶段我对格式控制的能力较弱（例如在,处插入空格）。

- user1063287

可能是重复的问题：易错 Q：UnicodeEncodeError: 'ascii' codec can't encode character。 - R. Martinho Fernandes

3个回答

2

    html = urllib.request.urlopen(THE_URL).read()
    soup = BeautifulSoup(html)
    print("'" + str(soup.encode("ascii")) + "'")

对我来说有效;-)

- Patpog

0

你尝试过使用 .decode() 或者 .decode("utf-8") 吗？

另外，我建议使用 lxml 并且使用 html5lib 解析器。

http://lxml.de/html5parser.html

- jeyraof

我尝试了这些并将结果添加到原始帖子中。我刚学习了beautiful soup和urllib2的基础知识，大约花了我两个星期，我真的需要再学两个程序吗？lxml对我来说看起来非常困难，这就是为什么我选择beautiful soup的原因，因为我可以更容易地理解它。再次强调，我只是想获取“简单”的英语文本，但它在常见元素（如正确的撇号'和...）上出现问题。 - user1063287

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- icktoofay · Accepted Answer

46

在Python 2中，如果unicode对象不能转换成ASCII编码，则无法进行打印。如果无法编码成ASCII，则会出现错误。您可能需要显式地进行编码，然后再打印结果的str：

print post.text.encode('utf-8')

- icktoofay

1

+ '\n\n' + post.text.encode("utf-8") + '\n\n' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)

- user1063287

1

@user1063287：encode方法不会引发UnicodeDecodeError异常。你的回溯信息是什么？ - icktoofay

1

@user1063287：我想我想说的是，我需要更多关于它的上下文。我知道单独使用post.text.encode('utf-8')应该没问题；只是有其他东西试图对其进行解码，而你没有展示正在执行解码操作的代码。如果你能编辑你的问题并包含一些关于它在哪里被使用的上下文信息，那将会很有帮助。 - icktoofay

2

基本上，Python 2 存在这个奇怪的 str 和 unicode 的事情。如果你将它们连接起来，那么它会隐式地将其作为 ASCII 编码或解码（我忘了哪一个），使它们成为相同的类型。当然，当处理非 ASCII 字符时，你不能这样做：你必须显式地确保所有内容都是相同的类型。Python 3 通过使其在混合使用时引发错误而不是采用有时可行有时不可行的行为来解决此问题。 - icktoofay

1

在Python 2中，只有当Unicode对象可以转换为ASCII时才能打印。这是不正确的。Python在启动时会检测区域设置，并配置stdout和stderr以自动编码写入这些文件对象的Unicode。这意味着对于正确配置的控制台和终端，打印非ASCII Unicode可以正常工作。 - Martijn Pieters

显示剩余4条评论