Scrapy爬取提取链接。

Question

Scrapy爬取提取链接。

3

我需要爬取一个网站，并在特定的xpath上爬取该网站中的每个url。例如，我需要爬取“http://someurl.com/world/”，其中容器(xpath("//div[@class='pane-content']"))中有10个链接，我需要爬取这些10个链接并从中提取图片，但是“http://someurl.com/world/”中的链接看起来像是“http://someurl.com/node/xxxx”。

目前为止我的进展：

import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem

class ImgurSpider(CrawlSpider):
    name = 'imgur'
    allowed_domains = ['someurl.com/']
    start_urls = ['http://someurl.com/news']
    rules = [Rule(LinkExtractor(allow=('/node/.*')), callback='parse_imgur', follow=True)]

    def parse_imgur(self, response):
        image = ImgurItem()
        image['title'] = response.xpath(\
            "//h1[@class='pane-content']/a/text()").extract()
        rel = response.xpath("//img/@src").extract()
        image['image_urls'] = response.xpath("//img/@src").extract()
        return image

- Nikola Niko

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Arijit C · Accepted Answer

你可以重写你的“规则”以适应你所有的要求，如下所示：

rules = [Rule(LinkExtractor(allow=('/node/.*',), restrict_xpaths=('//div[@class="pane-content"]',)), callback='parse_imgur', follow=True)]

要从提取的图像链接下载图像，您可以利用Scrapy捆绑的ImagePipeline