我已经爬取了很多网站,经常会想知道为什么在Firebug中显示的响应头和urllib.urlopen(url).info()
返回的响应头之间经常不同,因为Firebug报告了更多的标头。
今天我遇到了一个有趣的问题。我通过跟随“搜索URL”来爬取网站,该网站在重定向到最终页面之前完全加载(返回200状态代码)。执行爬取的最简单方法是返回Location
响应标头并进行另一次请求。然而,当我运行'urllib.urlopen(url).info()时,特定的标头是不存在的。
以下是区别:
Firebug标头:
Cache-Control : no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Connection : keep-alive
Content-Encoding : gzip
Content-Length : 2433
Content-Type : text/html
Date : Fri, 05 Oct 2012 15:59:31 GMT
Expires : Thu, 19 Nov 1981 08:52:00 GMT
Location : /catalog/display/1292/index.html
Pragma : no-cache
Server : Apache/2.0.55
Set-Cookie : PHPSESSID=9b99dd9a4afb0ef0ca267b853265b540; path=/
Vary : Accept-Encoding,User-Agent
X-Powered-By : PHP/4.4.0
我的代码返回的头信息:
Date: Fri, 05 Oct 2012 17:16:23 GMT
Server: Apache/2.0.55
X-Powered-By: PHP/4.4.0
Set-Cookie: PHPSESSID=39ccc547fc407daab21d3c83451d9a04; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding,User-Agent
Content-Type: text/html
Connection: close
这是我的代码:
from BeautifulSoup import BeautifulSoup
import urllib
import psycopg2
import psycopg2.extras
import scrape_tools
tools = scrape_tools.tool_box()
db = tools.db_connect()
cursor = db.cursor(cursor_factory = psycopg2.extras.RealDictCursor)
cursor.execute("SELECT data FROM table WHERE variable = 'Constant' ORDER BY data")
for row in cursor:
url = 'http://www.website.com/search/' + row['data']
headers = {
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' : 'gzip, deflate',
'Accept-Language' : 'en-us,en;q=0.5',
'Connection' : 'keep-alive',
'Host' : 'www.website.com',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1'
}
post_params = {
'query' : row['data'],
'searchtype' : 'products'
}
post_args = urllib.urlencode(post_params)
soup = tools.request(url, post_args, headers)
print tools.get_headers(url, post_args, headers)
请注意:
scrape_tools
是我自己编写的一个模块。该模块中用于获取头部信息的代码(基本上)如下:class tool_box:
def get_headers(self, url, post, headers):
file_pointer = urllib.urlopen(url, post, headers)
return file_pointer.info()
有什么原因导致这种差异吗?我的代码中是否有愚蠢的错误?如何检索缺失的标题数据?我对Python相当新,所以请原谅任何愚蠢的错误。
提前感谢。非常感谢任何建议!
还有……很抱歉代码太长了 =\