urllib2读取转Unicode

Question

urllib2读取转Unicode

46

我需要存储一个可能使用任何语言的网站内容，并且需要能够搜索Unicode字符串。

我尝试了以下代码：

import urllib2

req = urllib2.urlopen('http://lenta.ru')
content = req.read()

这个内容是一个字节流，因此我可以搜索其中的Unicode字符串。

我需要一种方法，在使用urlopen和读取时，可以使用头部中的字符集来解码内容并将其编码为UTF-8。

- Vitaly Babiy

编码是使用urllib库中的一个函数完成的，而不是使用urllib2。详情请见http://www.voidspace.org.uk/python/articles/urllib2.shtml#headers。 - Macarse

2

@Macarse 这不是 Vitaly 所指的编码方式，他指的是使用 '[byte string]'.decode('[charset]') 和 u'[unicode string]'.encode('utf-8') 对请求上下文进行解码和编码。而你所指的是对请求参数进行编码。 - Remco Wendt

相关内容：在Python中获取HTTP响应的字符集/编码的好方法 - jfs

2个回答

10

要解析 Content-Type http头，您可以使用cgi.parse_header函数：

import cgi
import urllib2

r = urllib2.urlopen('http://lenta.ru')
_, params = cgi.parse_header(r.headers.get('Content-Type', ''))
encoding = params.get('charset', 'utf-8')
unicode_text = r.read().decode(encoding)

获取字符集的另一种方法：

>>> import urllib2
>>> r = urllib2.urlopen('http://lenta.ru')
>>> r.headers.getparam('charset')
'utf-8'

或者在Python 3中：

>>> import urllib.request
>>> r = urllib.request.urlopen('http://lenta.ru')
>>> r.headers.get_content_charset()
'utf-8'

字符编码也可以在html文档内指定，例如：<meta charset="utf-8">。

- jfs

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex Martelli · Accepted Answer

在您执行的操作后，您将看到：

>>> req.headers['content-type']
'text/html; charset=windows-1251'

因此：

>>> encoding=req.headers['content-type'].split('charset=')[-1]
>>> ucontent = unicode(content, encoding)

ucontent现在是一个Unicode字符串（共140655个字符）-- 因此，例如，如果您的终端是UTF-8，则可以显示其中的一部分：

>>> print ucontent[76:110].encode('utf-8')
<title>Lenta.ru: Главное: </title>

而且您可以搜索等等。

编辑：Unicode输入输出通常很棘手（这可能是原始提问者遇到的问题），但我要绕过将Unicode字符串输入交互式Python解释器（与原始问题完全不相关）的难题，来展示一旦Unicode字符串正确输入（我通过代码点在操作，有些傻但不难），搜索绝对是易如反掌的（因此希望原始问题已经得到彻底解答）。再次假设一个UTF-8终端：

>>> x=u'\u0413\u043b\u0430\u0432\u043d\u043e\u0435'
>>> print x.encode('utf-8')
Главное
>>> x in ucontent
True
>>> ucontent.find(x)
93

注意：请记住，这种方法可能无法适用于所有网站，因为某些网站仅在服务的文档内指定字符编码（例如使用 http-equiv meta 标签）。