在参考如何在Scrapy爬虫中传递用户定义的参数之后,我编写了以下简单的爬虫:
import scrapy
class Funda1Spider(scrapy.Spider):
name = "funda1"
allowed_domains = ["funda.nl"]
def __init__(self, place='amsterdam'):
self.start_urls = ["http://www.funda.nl/koop/%s/" % place]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
这似乎有效;例如,如果我使用命令行运行它,如下所示:
scrapy crawl funda1 -a place=rotterdam
它生成一个名为“rotterdam.html”的文件,看起来类似于http://www.funda.nl/koop/rotterdam/。接下来,我想扩展它,以便可以指定子页面,例如http://www.funda.nl/koop/rotterdam/p2/。我尝试了以下内容:
import scrapy
class Funda1Spider(scrapy.Spider):
name = "funda1"
allowed_domains = ["funda.nl"]
def __init__(self, place='amsterdam', page=''):
self.start_urls = ["http://www.funda.nl/koop/%s/p%s/" % (place, page)]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
然而,如果我尝试用以下方式运行它
scrapy crawl funda1 -a place=rotterdam page=2
I get the following error:
crawl: error: running 'scrapy crawl' with more than one spider is no longer supported
我不太理解这个错误信息,因为我并没有尝试爬取两个蜘蛛,只是试图传递两个关键字参数来修改
start_urls
。我该如何使其正常工作?