从Scrapy管道中提取Close Spider

Question

6

我需要在Scrapy Pipeline中引发CloseSpider。要么从Pipeline返回一些参数给Spider，以便进行raise操作。

例如，如果日期已经存在，则引发CloseSpider：

raise CloseSpider('Already been scraped:' + response.url)

有没有方法可以做到这一点？

- MoreScratch

相关链接：https://dev59.com/5Wkw5IYBdhLWcg3ws833#9699317。 - alecxe

无法从管道中调用close_spider。可以通过在pipeline的process_item函数中设置spider实例中的变量来使用hack。 - user12123215

2个回答

1

我更喜欢以下解决方案。

class MongoDBPipeline(object):

def process_item(self, item, spider):
    spider.crawler.engine.close_spider(self, reason='duplicate')

- Macbric

某种程度上这个不起作用。停止爬虫的首选解决方案似乎是在解析函数中停止yielding。 - vladimir.gorea

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user12123215 · Accepted Answer

根据scrapy文档，只能从回调函数（默认为parse函数）中在Spider中引发CloseSpider异常。在pipeline中引发它会导致Spider崩溃。为了实现类似的结果，您可以发出关闭信号以优雅地关闭scrapy。

from scrapy.project import crawler  
crawler._signal_shutdown(9,0)

请注意，Scrapy在接收到关闭信号后仍可能处理已经发送或甚至已经安排的请求。

如果要从Spider执行此操作，请像这样从Pipeline设置一些变量。

def process_item(self, item, spider):
    if some_condition_is_met:
        spider.close_manually = True

在您的爬虫回调函数中，您可以抛出关闭爬虫异常。

def parse(self, response):
    if self.close_manually:
        raise CloseSpider('Already been scraped.')