我需要爬取一个网站,并在特定的xpath上爬取该网站中的每个url。例如,我需要爬取“http://someurl.com/world/”,其中容器(xpath("//div[@class='pane-content']"))中有10个链接,我需要爬取这些10个链接并从中提取图片,但是“http://someurl.com/world/”中的链接看起来像是“http://someurl.com/node/xxxx”。
目前为止我的进展:
目前为止我的进展:
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem
class ImgurSpider(CrawlSpider):
name = 'imgur'
allowed_domains = ['someurl.com/']
start_urls = ['http://someurl.com/news']
rules = [Rule(LinkExtractor(allow=('/node/.*')), callback='parse_imgur', follow=True)]
def parse_imgur(self, response):
image = ImgurItem()
image['title'] = response.xpath(\
"//h1[@class='pane-content']/a/text()").extract()
rel = response.xpath("//img/@src").extract()
image['image_urls'] = response.xpath("//img/@src").extract()
return image