从Python脚本向Scrapy Spider传递参数

5

在发布此问题之前,我已经参考了一些与此相关的问题(在发布本问题之前,我目前没有所有参考问题的链接):

如果不传递参数并要求来自 BBSpider 类的用户输入(在 name="dmoz" 行下面,没有 main 函数),或者提供它们作为预定义(即静态)参数,则可以完全运行此代码。

我的代码在这里

我基本上想在不需要任何其他文件(即使是设置文件)的情况下从 Python 脚本执行 Scrapy spider。这就是为什么我在代码中指定了设置。

这是我在执行此脚本时得到的输出:

http://bigbasket.com/ps/?q=apple
2015-06-26 12:12:34 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:12:34 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:12:34 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:12:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:12:35 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:12:35 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:12:35 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:12:35 [scrapy] INFO: Spider opened
2015-06-26 12:12:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:12:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:12:35 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url
    raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2015-06-26 12:12:35 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:12:35 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 342543),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 339158)}
2015-06-26 12:12:35 [scrapy] INFO: Spider closed (finished)

我目前遇到的问题如下:
  • 如果你仔细看一下我的输出(Line 1和Line 6),你会发现我传递给爬虫的start_url被打印了两次,尽管我只在代码的Line 31上写了一次print语句(链接在上面)。为什么会这样,而且还有不同的值(我的输出中第1行的初始print语句输出给出了正确的结果,虽然第6行的print语句输出是错误的)?不仅如此,即使我写 - print 'hi' - 它也会打印两次。这是为什么?
  • 接下来,如果你看一下我的输出中的这一行-:
    TypeError: Request url must be str or unicode, got NoneType:
    为什么会出现这种情况(即使我发布的问题的链接已经写了同样的东西)?我不知道如何解决它?我甚至尝试过 `self.start_urls=[str(kwargs.get('start_url'))]` - 然后它会产生以下输出:
http://bigbasket.com/ps/?q=apple
2015-06-26 12:28:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:28:00 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:28:00 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:28:00 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:28:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:28:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:28:01 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:28:01 [scrapy] INFO: Spider opened
2015-06-26 12:28:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:28:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:28:01 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: None
2015-06-26 12:28:01 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:28:01 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 248350),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 236056)}
2015-06-26 12:28:01 [scrapy] INFO: Spider closed (finished)

请帮我解决上述两个错误。


你有检查过这个答案吗?如何在Python脚本中运行Scrapy - eLRuLL
@eLRuLL:是的,我已经检查过了。首先,它没有提到需要在蜘蛛类中进行哪些更改(这是我的主要问题的核心 - 我上面列出的两个问题都在代码的那部分)。另一件事,他们说的是我已经做过的完全相似的事情(如果您看到我的代码),当我调用蜘蛛爬行时。请告诉我如何解决这个问题!谢谢! - Ashutosh Saboo
1个回答

11
您需要在CrawlerProcesscrawl方法中传递参数,所以您需要像这样运行它:

您需要在CrawlerProcesscrawl方法中传递参数,所以您需要像这样运行它:

crawler = CrawlerProcess(Settings())
crawler.crawl(BBSpider, start_url=url)
crawler.start()

谢谢,它完美地运行了。只有一个澄清和一个疑问。为什么会发生问题1(即它被打印两次)?而且这个疑问是 - 如果我想使用多进程库并行执行2个蜘蛛,我可以像这样传递队列,然后使用queue.put(items),最后使用queue.get()方法从脚本的主函数中访问蜘蛛的输出。这样做可能吗?你能给我一个实现这个功能的示例代码吗?如果您能提供那段代码,我将非常感激。谢谢,请提供那段代码。 - Ashutosh Saboo
重复的打印出现是因为在调用爬虫之前实例化了一个Spider对象,这就是第一个打印输出,然后你将一个Spider实例传递给了爬虫,但它没有得到任何参数,所以这就是第二个打印输出。关于第二个问题,我认为这可能是可能的,但我现在没有例子,抱歉。 - eLRuLL
非常感谢您的回复。您已经解决了我的疑惑。对于第二个问题,您能否通过提供使用Python-multiprocessing库的代码来帮助我为两个不同的start_urls调用BBSpider类的2个蜘蛛程序?我尝试过了,但它给出了一些奇怪的错误。如果您能够提供相关的代码,我将不胜感激!如果您能够提供代码,我会非常感激。请尽量尝试提供相关的代码。谢谢! - Ashutosh Saboo
另外,我尝试了检查几个与我上面提出的疑问相关的类似问题,但它们似乎都不起作用。这就是为什么我想如果你能帮忙,那我会非常感激你。由于我最近学习了Scrapy(几天前),所以我有这个疑问。请尽力帮助。谢谢! - Ashutosh Saboo
也许您可以提出一个不同的问题,这样我(以及更多人)就可以帮助您解决。 - eLRuLL
显示剩余5条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接