如何在Python Mechanize中修复编码问题?

5

以下是示例代码:

from mechanize import Browser

br = Browser()
page = br.open('http://hunters.tclans.ru/news.php?readmore=2')
br.form = br.forms().next()
print br.form

问题在于服务器返回了错误的编码(windows-cp1251)。我该如何在机械化中手动设置当前页面的编码?
错误:
Traceback (most recent call last):
  File "/tmp/stackoverflow.py", line 5, in <module>
    br.form = br.forms().next()
  File "/usr/local/lib/python2.6/dist-packages/mechanize/_mechanize.py", line 426, in forms
    return self._factory.forms()
  File "/usr/local/lib/python2.6/dist-packages/mechanize/_html.py", line 559, in forms
    self._forms_factory.forms())
  File "/usr/local/lib/python2.6/dist-packages/mechanize/_html.py", line 225, in forms
    _urlunparse=_rfc3986.urlunsplit,
  File "/usr/local/lib/python2.6/dist-packages/ClientForm.py", line 967, in ParseResponseEx
    _urlunparse=_urlunparse,
  File "/usr/local/lib/python2.6/dist-packages/ClientForm.py", line 1104, in _ParseFileEx
    fp.feed(data)
  File "/usr/local/lib/python2.6/dist-packages/ClientForm.py", line 870, in feed
    sgmllib.SGMLParser.feed(self, data)
  File "/usr/lib/python2.6/sgmllib.py", line 104, in feed
    self.goahead(0)
  File "/usr/lib/python2.6/sgmllib.py", line 193, in goahead
    self.handle_entityref(name)
  File "/usr/local/lib/python2.6/dist-packages/ClientForm.py", line 751, in handle_entityref
    '&%s;' % name, self._entitydefs, self._encoding))
  File "/usr/local/lib/python2.6/dist-packages/ClientForm.py", line 238, in unescape
    return re.sub(r"&#?[A-Za-z0-9]+?;", replace_entities, data)
  File "/usr/lib/python2.6/re.py", line 151, in sub
    return _compile(pattern, 0).sub(repl, string, count)
  File "/usr/local/lib/python2.6/dist-packages/ClientForm.py", line 230, in replace_entities
    repl = repl.encode(encoding)
LookupError: unknown encoding: windows-cp1251
2个回答

3

我不知道Mechanize,但是你可以通过修改codecs来接受错误的编码名称,这些名称既包含“windows”又包含“cp”:

>>> def fixcp(name):
...     if name.lower().startswith('windows-cp'):
...         try:
...             return codecs.lookup(name[:8]+name[10:])
...         except LookupError:
...             pass
...     return None
... 
>>> codecs.register(fixcp)
>>> '\xcd\xe0\xef\xee\xec\xe8\xed\xe0\xe5\xec'.decode('windows-cp1251')
u'Напоминаем'

获取所需的值并不是问题。现在的问题是如何访问机械化实体的_AbstractFormParser。 - Fluffy

2

通过设置来进行修复

br._factory.encoding = enc
br._factory._forms_factory.encoding = enc
br._factory._links_factory._encoding = enc

在 br.open() 后面加上下划线(注意是下划线)。

请问您能解释一下这个例子中enc的值是什么吗?谢谢。 - abu

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接