如果通过process.crawl()运行，Scrapy CrawlSpider不执行LinkExtractor。

Question

如果通过process.crawl()运行，Scrapy CrawlSpider不执行LinkExtractor。

4

我不明白为什么我的爬虫只抓取start_url，而忽略提取与allow参数匹配的任何URL。

from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import DropItem
from scrapy.settings import Settings
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = "my_spider"
    allowed_domains = ["website.com/"]
    rules = [Rule(LinkExtractor(allow='/product_page/'), callback='parse', follow=True)]
    start_urls = ["http://www.website.com/list_of_products.php"]    
    custom_settings = {
        "ROBOTSTXT_OBEY": "True",
        "COOKIES_ENABLED": "False",
        "LOG_LEVEL": 'INFO'
    }

    def parse(self, response):
        try:
            item = {
                # populate "item" with data
            }
            yield MyItem(**item)
        except (DropItem, Exception) as e:
            raise DropItem("WARNING: Product item dropped due to obligatory field not being present - %s" % response.url)


if __name__ == '__main__':
    settings = Settings()
    settings.set('ITEM_PIPELINES', {
        'pipelines.csv_pipeline.CsvPipeline': 100
    })
    process = CrawlerProcess(settings)
    process.crawl(MySpider)
    process.start()

我不确定这个问题是否由于从__name__中调用而发生。

- Carlos

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ismael Padilla · Accepted Answer

问题可能是您正在重新定义解析方法，这应该避免。从crawling rules docs中得知：

警告在编写爬虫规则时，避免使用parse作为回调函数，因为CrawlSpider自己使用parse方法来实现其逻辑。因此，如果您覆盖parse方法，则爬行蜘蛛将不再工作。

所以我建议尝试将函数命名为其他名称（我将其重命名为parse_item，类似于文档中的CrawlSpider示例，但您可以使用任何名称）：

class MySpider(CrawlSpider):
    name = "my_spider"
    allowed_domains = ["website.com"]
    rules = [Rule(LinkExtractor(allow='/product_page/.+'), callback='parse_item', follow=True),
             Rule(LinkExtractor(allow='/list_of_products.+'), callback='parse', follow=True)]
    start_urls = ["http://www.website.com/list_of_products.php"]    
    custom_settings = {
        "ROBOTSTXT_OBEY": "True",
        "COOKIES_ENABLED": "False",
        "LOG_LEVEL": 'INFO'
    }

    def parse_item(self, response):
        try:
            item = {
                # populate "item" with data
            }
            yield MyItem(**item)
        except (DropItem, Exception) as e:
            raise DropItem("WARNING: Product item dropped due to obligatory field not being present - %s" % response.url)