Python urllib打开问题

Question

Python urllib打开问题

4

我正在尝试从http://book.libertorrent.com/获取数据，但目前因为响应中存在一些额外的数据(头部信息)而失败。我的代码非常简单:

response = urllib.urlopen('http://book.libertorrent.com/login.php')
f = open('someFile.html', 'w')
f.write(response.read())

read() 返回：

Date: Fri, 09 Nov 2012 07:36:54 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
Cache-Control: no-cache, pre-check=0, post-check=0
Expires: 0
Pragma: no-cache
Set-Cookie: bb_test=973132321; path=/; domain=book.libertorrent.com
Content-Language: ru

1ec0
...Html...
0

同时 response.info() 返回空数据。

有没有办法纠正响应内容？

- maravan

1

在response.read()之后，response.getcode()返回什么？在我的Mac上，response.read()返回html，而.getcode()返回200，表示成功。 - Hai Vu

1

你的方法通常是可行的；我在使用那个网站时遇到了和你一样的问题... - Matthew Adams

1

我也是，有趣的是它可以在Python 3中运行。 - poke

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mata · Accepted Answer

让我们试试这个：

$ echo -ne "GET /index.php HTTP/1.1\r\nHost: book.libertorrent.com\r\n\r\n" | nc book.libertorrent.com 80 | head -n 10
HTTP/1.1 200 OK
WWW
Date: Sat, 10 Nov 2012 17:41:57 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Content-Language: ru

1f57
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html dir="ltr">

看到第二行的“WWW”了吗？那不是有效的HTTP头，我猜这就是导致响应解析器出问题的原因。

顺便说一下，Python2和Python3在这里的行为不同：

Python2似乎会立即将无效头之后的任何内容解释为内容
Python3忽略所有头信息，并继续读取双换行符后的内容。由于头信息被忽略，所以传输编码也被忽略，因此内容长度被解释为正文的一部分。

因此，最终问题在于服务器发送了一个无效的响应，应该在服务器端进行修复。