如何使Scrapy只爬取1个页面（不进行递归）？

Question

如何使Scrapy只爬取1个页面（不进行递归）？

5

我正在使用最新版本的scrapy (http://doc.scrapy.org/en/latest/index.html)，并试图找出如何使scrapy仅爬取作为start_url列表的一部分提供的URL。在大多数情况下，我只想爬取一页，但在某些情况下，可能会有多个页面，我将指定这些页面。我不希望它爬到其他页面。

我尝试设置深度级别=1，但我不确定在测试中是否实现了我所希望的目标。

非常感谢您的帮助！

谢谢！

2015-12-22 - 代码更新:

# -*- coding: utf-8 -*-
import scrapy
from generic.items import GenericItem

class GenericspiderSpider(scrapy.Spider):
    name = "genericspider"

    def __init__(self, domain, start_url, entity_id):
        self.allowed_domains = [domain]
        self.start_urls = [start_url]
        self.entity_id = entity_id


    def parse(self, response):
        for href in response.css("a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.xpath("//body//a"):
            item = GenericItem()

            item['entity_id'] = self.entity_id
            # gets the actual email address
            item['emails'] = response.xpath("//a[starts-with(@href, 'mailto')]").re(r'mailto:\s*(.*?)"')


            yield item

在第一次回复中，您提到使用通用爬虫---这不是我在代码中正在做的吗？此外，您是否建议我删除

callback=self.parse_dir_contents

从parse函数中获取？

谢谢。

- Manish

3个回答

0

我遇到了同样的问题，因为我在使用

import scrapy from scrapy.spiders import CrawlSpider

然后我换成了

import scrapy from scrapy.spiders import Spider

并将类更改为

class mySpider(Spider):

- KongPHS

0

以下是一个爬虫的代码，它将从博客中抓取标题（注意：每个博客的xpath可能不同）

文件名：/spiders/my_spider.py

class MySpider(scrapy.Spider):
name = "craig"
allowed_domains = ["www.blogtrepreneur.com"]
start_urls = ["http://www.blogtrepreneur.com/the-best-juice-cleanse-for-weight-loss/"]


def parse(self, response):
    hxs = HtmlXPathSelector(response)
    dive = response.xpath('//div[@id="tve_editor"]')
    items = []
    item = DmozItem()
    item["title"] = response.xpath('//h1/text()').extract()
    item["article"] = response.xpath('//div[@id="tve_editor"]//p//text()').extract()
    items.append(item)
    return items

以上代码仅获取给定文章的标题和正文内容。

- Shikhar Gupta

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- eLRuLL · Accepted Answer

看起来您正在使用 CrawlSpider，这是一种特殊类型的 Spider，用于爬取多个页面内的多个类别。

如果只想爬取在 start_urls 中指定的URL，请覆盖 parse 方法，因为这是默认的回调函数。