美味汤（BeautifulSoup）中文字符编码错误

Question

美味汤（BeautifulSoup）中文字符编码错误

pythonpython-2.7unicodeencodingbeautifulsoup

4

我正在尝试识别并保存特定网站上的所有标题，但一直遇到编码错误。

该网站为：http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm

当前代码如下：

holder = {}  

url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()

soup = BeautifulSoup(url, 'lxml')

head1 = soup.find_all(['h1','h2','h3'])

print head1

holder["key"] = head1

打印输出的结果是：

[<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>]

我相信那些是Unicode字符，但我还没有找到如何让Python将它们显示为字符的方法。

我已经尝试在其他地方寻找答案。更明确的问题是这个： Python和BeautifulSoup编码问题建议添加

。

soup = BeautifulSoup.BeautifulSoup(content.decode('utf-8','ignore'))

然而，这给了我与评论中提到的相同错误("AttributeError: type object 'BeautifulSoup' has no attribute 'BeautifulSoup'")。去掉第二个'.BeautifulSoup'会导致不同的错误("RuntimeError: maximum recursion depth exceeded while calling a Python object")。

我还尝试了这里建议的答案: Python中使用BeautifulSoup时出现中文字符编码错误？，通过分解对象的创建。

html = urllib2.urlopen("http://www.515fa.com/che_1978.html")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

但这也会产生递归错误。非常感谢您提供其他任何提示。谢谢。

- user5356756

我曾经遇到同样的问题，尝试了这个方法，它有效：https://stackoverflow.com/a/65354890/20294353 - semui

2个回答

0

这可能提供了一个相当简单的解决方案，不确定它是否完全满足您的需求，请告诉我：

holder = {}  

url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()

soup = BeautifulSoup(url, 'lxml')

head1 = soup.find_all(['h1','h2','h3'])

print unicode(head1)

holder["key"] = head1

参考资料：Python 2.7 Unicode

- Josh Rumbut

谢谢！不幸的是，这给了我和之前完全相同的输出，所以我仍然得到了u1234而不是字符。 - user5356756

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Padraic Cunningham · Accepted Answer

使用unicode-escape进行解码：

In [6]: from bs4 import BeautifulSoup

In [7]: h = """<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>"""

In [8]: soup = BeautifulSoup(h, 'lxml')

In [9]: print(soup.h3.text.decode("unicode-escape"))
环境污染最小化 资源利用最大化

如果您查看源代码，可以看到数据是 UTF-8 编码的：

<meta http-equiv="content-language" content="utf-8" />

对于我而言，使用bs4 4.4.1只需解码urllib返回的内容即可：

In [1]: from bs4 import BeautifulSoup

In [2]: import urllib

In [3]: url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()

In [4]: soup = BeautifulSoup(url.decode("utf-8"), 'lxml')

In [5]: print(soup.h3.text)
环境污染最小化 资源利用最大化

当你要写入csv文件时，你需要将数据编码为utf-8字符串：

 .decode("unicode-escape").encode("utf-8")

你可以在保存字典数据时进行编码。