如您所猜测的那样,此网站使用JavaScript在您滚动页面时加载更多项目。
通过使用浏览器中包含的开发人员工具(对于Chromium浏览器,请按Ctrl-Maj i),我在网络选项卡中看到页面中包含的JavaScript脚本执行以下请求以加载更多项目:
GET http://www.website-your-are-crawling.com/men/shoes/?page=2
网络服务器响应以下类型的文档:
<li id="PH969SH70HPTINDFAS" class="itm hasOverlay unit size1of4 ">
<div id="qa-quick-view-btn" class="quickviewZoom itm-quickview ui-buttonQuickview l-absolute pos-t" title="Quick View" data-url ="phosphorus-Black-Moccasins-233629.html" data-sku="PH969SH70HPTINDFAS" onClick="_gaq.push(['_trackEvent', 'BadgeQV','Shown','OFFER INSIDE']);">Quick view</div>
<div class="itm-qlInsert tooltip-qlist highlightStar"
onclick="javascript:Rocket.QuickList.insert('PH969SH70HPTINDFAS', 'catalog');
return false;" >
<div class="starHrMsg">
<span class="starHrMsgArrow"> </span>
Save for later </div>
</div>
<a id='cat_105_PH969SH70HPTINDFAS' class="itm-link sobrTxt" href="/phosphorus-Black-Moccasins-233629.html"
onclick="fireGaq('_trackEvent', 'Catalog to PDP', 'men--Shoes--Moccasins', 'PH969SH70HPTINDFAS--1699.00--', this),fireGaq('_trackEvent', 'BadgePDP','Shown','OFFER INSIDE', this);">
<span class="lazyImage">
<span style="width:176px;height:255px;" class="itm-imageWrapper itm-imageWrapper-PH969SH70HPTINDFAS" id="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" itm-img-width="176" itm-img-height="255" itm-img-sprites="4">
<noscript><img src="http://static4.jassets.com/p/Phosphorus-Black-Moccasins-6668-926332-1-catalog.jpg" width="176" height="255" class="itm-img"></noscript>
</span>
</span>
<span class="itm-budgeFlag offInside"><span class="flagBrdLeft"></span>OFFER INSIDE</span>
<span class="itm-Catbrand strong">Phosphorus</span>
<span class="itm-title">
Black Moccasins </span>
这些文档包含更多的项目。
因此,为了获取完整的项目列表,您需要在Spider的
parse
方法中返回
Request
对象(请参见
Spider类文档),以告诉scrapy它应该加载更多数据:
def parse(self, response):
n = number of the next "page" to parse
req = Request(url="http://www.website-your-are-crawling.com/men/shoes/?page=" + n,
headers = {"Referer": "http://www.website-your-are-crawling.com/men/shoes/",
"X-Requested-With": "XMLHttpRequest"})
return req
顺便提一下(如果您想测试),您不能只在浏览器中加载http://www.website-your-are-crawling.com/men/shoes/?page=2
以查看返回的内容,因为如果X-Requested-With
标头与XMLHttpRequest
不同,该网站将重定向您到全局页面(即http://www.website-your-are-crawling.com/men/shoes/
)。
response.header['X-Requested-With']
是否等于 "XMLHttpRequest",所以网站可能会将您重定向到(或提供)原始项目页面。此外,您应该使用yield req
或将所有请求放入列表中。 - Xion345ERROR: Spider must return Request, BaseItem or None, got 'Request' in <GET http://www.jabong.com/men/shoes/>
。 - Vaibhav Jain