在Python中获取HTTP响应的字符集/编码的好方法

Question

在Python中获取HTTP响应的字符集/编码的好方法

pythoncharacter-encodinghttprequesturllib2

31

我想寻找一种简单的方法，在使用Python urllib2或任何其他Python库时获取HTTP响应的字符集/编码信息。

>>> url = 'http://some.url.value'
>>> request = urllib2.Request(url)
>>> conn = urllib2.urlopen(request)
>>> response_encoding = ?

我知道有时候会在“Content-Type”标头中出现，但该标头包含其他信息，并且它嵌入在需要解析的字符串中。例如，Google返回的Content-Type标头是：

>>> conn.headers.getheader('content-type')
'text/html; charset=utf-8'

我可以使用这个，但是我不确定格式会有多一致。我很确定字符集可能完全缺失，所以我必须处理这种边缘情况。似乎进行某种字符串分割操作来获取“utf-8”似乎是错误的做法。

>>> content_type_header = conn.headers.getheader('content-type')
>>> if '=' in content_type_header:
>>>  charset = content_type_header.split('=')[1]

这种代码看起来好像做了太多的工作，而且我也不确定它在所有情况下都能正常工作。有没有更好的方法可以实现同样的功能呢？

- Clay Wardell

6个回答

7

如果您熟悉Flask/Werkzeug Web开发堆栈，那么您会很高兴知道Werkzeug库对这种HTTP头解析有一个确切的答案，并且考虑了内容类型未被指定的情况，就像您所希望的那样。

 >>> from werkzeug.http import parse_options_header
 >>> import requests
 >>> url = 'http://some.url.value'
 >>> resp = requests.get(url)
 >>> if resp.status_code is requests.codes.ok:
 ...     content_type_header = resp.headers.get('content_type')
 ...     print content_type_header
 'text/html; charset=utf-8'
 >>> parse_options_header(content_type_header) 
 ('text/html', {'charset': 'utf-8'})

那么你可以这样做：

 >>> content_type_header[1].get('charset')
 'utf-8'

请注意，如果未提供charset，则会产生以下结果：

 >>> parse_options_header('text/html')
 ('text/html', {})

即使您仅提供空字符串或字典，它也可以正常工作：

 >>> parse_options_header({})
 ('', {})
 >>> parse_options_header('')
 ('', {})

因此，它似乎正是您正在寻找的！如果您查看源代码，您会发现他们考虑了您的用途: https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/http.py#L320-329

def parse_options_header(value):
    """Parse a ``Content-Type`` like header into a tuple with the content
    type and the options:
    >>> parse_options_header('text/html; charset=utf8')
    ('text/html', {'charset': 'utf8'})
    This should not be used to parse ``Cache-Control`` like headers that use
    a slightly different format.  For these headers use the
    :func:`parse_dict_header` function.
    ...

希望这能在某一天帮助到某个人！ :)

- Brian Peterson

5

requests库使此变得容易：

>>> import requests
>>> r = requests.get('http://some.url.value')
>>> r.encoding
'utf-8' # e.g.

- dnozay

3

除了请求中的编码检测不正确（不考虑元标记），而且他们不愿意修复它（https://github.com/kennethreitz/requests/issues/1087）。 - Mikhail Korobov

1

请看我的回答，链接在这里 https://dev59.com/XFcP5IYBdhLWcg3wkKtD#52615216，你可以直接使用 requests.Response.apparent_encoding。 - bubak

3

字符集可以用多种方式指定，但通常是在头部进行指定。

>>> urlopen('http://www.python.org/').info().get_content_charset()
'utf-8'
>>> urlopen('http://www.google.com/').info().get_content_charset()
'iso-8859-1'
>>> urlopen('http://www.python.com/').info().get_content_charset()
>>>

那个最后的示例没有指定字符集，因此get_content_charset()返回了None。

- Cees Timmerman

1

它只查看可能存在误导的http头。HTML文档中的<meta charset = ..>更有可能受到创建文档的人的控制，而不是服务器的头文件。此外，在Python 2中没有get_content_charset()。cgi.parse_header()在Python 2和3上的工作方式相同。 - jfs

这在Python 3中非常有效，作为从头信息检查字符集的初始检查，您可以首先检查此内容，如果为空，则对内容本身执行BeautifulSoup检查。 - james-see

2

为了正确（即以类似浏览器的方式 - 我们无法做得更好）解码HTML，您需要考虑以下几点：

Content-Type HTTP 标头值；
BOM标记；
页面正文中的 <meta> 标签；
Web中使用的编码名称和Python stdlib中可用的编码名称之间的差异；
作为最后的选择，如果其他所有方法都失败了，则可以根据统计数据进行猜测。

上述所有内容都在 w3lib.encoding.html_to_unicode 函数中实现：它具有 html_to_unicode(content_type_header, html_body_str, default_encoding='utf8', auto_detect_fun=None) 签名，并返回 (detected_encoding, unicode_html_content) 元组。

requests、BeautifulSoup、UnicodeDamnnit、chardet 或 flask 的 parse_options_header 都不是正确的解决方案，因为它们在某些方面都会失败。

- Mikhail Korobov

我正在寻找一种简单地扫描字节并从元标记中检索编码的解决方案。真的很不错！ - evg656e

感谢指出w3lib库。它非常适合我的使用情况，特别是：w3lib.encoding.html_to_unicode。 - Musab Gultekin

感谢指出w3lib库。它非常适合我的使用情况。特别是：w3lib.encoding.html_to_unicode - undefined

0

这对我来说完美无瑕。

我正在使用Python 2.7和3.4版本。

print (text.encode('cp850','replace'))

- Usama Tahir

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jfs · Accepted Answer

为了解析HTTP头，您可以使用cgi.parse_header()函数：

_, params = cgi.parse_header('text/html; charset=utf-8')
print params['charset'] # -> utf-8

或者使用响应对象：

response = urllib2.urlopen('http://example.com')
response_encoding = response.headers.getparam('charset')
# or in Python 3: response.headers.get_content_charset(default)

一般来说，服务器可能会虚假报告编码或者根本不报告（默认取决于内容类型），或者编码可能在响应体内指定，例如HTML文档中的<meta>元素或XML文档的声明中。作为最后的手段，可以从内容本身猜测编码。您可以使用requests获取Unicode文本。

import requests # pip install requests

r = requests.get(url)
unicode_str = r.text # may use `chardet` to auto-detect encoding

或者使用BeautifulSoup解析HTML（并作为副作用转换为Unicode）：

from bs4 import BeautifulSoup # pip install beautifulsoup4

soup = BeautifulSoup(urllib2.urlopen(url)) # may use `cchardet` for speed
# ...

或者直接使用bs4.UnicodeDammit来处理任意内容（不一定是html）：

from bs4 import UnicodeDammit

dammit = UnicodeDammit(b"Sacr\xc3\xa9 bleu!")
print(dammit.unicode_markup)
# -> Sacré bleu!
print(dammit.original_encoding)
# -> utf-8