TL;DR: 请查看自包含最小示例脚本以运行Scrapy。
首先,使用单独的.cfg
、settings.py
、pipelines.py
、items.py
、spiders
包等来管理和处理您的网络爬虫逻辑是推荐的方法。它提供了模块化、关注点分离的功能,使事情有条理、清晰且易于测试。
如果您正在按照官方Scrapy教程创建项目,则可以通过特殊的scrapy
命令行工具来运行网络爬虫:
scrapy crawl myspider
但是,Scrapy
也提供了一个API来从脚本运行爬虫。
有几个关键概念需要提到:
Settings
类 - 基本上是一个键值对的“容器”,它使用默认内置值进行初始化。
Crawler
类 - 是主要的类,它像胶水一样连接了所有涉及到Scrapy网络爬虫的不同组件。
Twisted
反应器 - 由于Scrapy是建立在Twisted
异步网络库之上的,因此要启动爬虫,我们需要将其放入Twisted反应器中,简单来说,这是一个事件循环:
反应器是Twisted中事件循环的核心 - 驱动使用Twisted的应用程序的循环。事件循环是一种编程构造,它在程序中等待和分派事件或消息。它通过调用某些内部或外部的“事件提供者”来工作,该提供者通常会阻塞,直到事件到达,然后调用相关的事件处理程序(“分派事件”)。反应器为许多服务提供基本接口,包括网络通信、线程和事件分派。
这里是从脚本运行Scrapy的基本和简化过程:
create a Settings
instance (or use get_project_settings()
to use existing settings):
settings = Settings()
instantiate Crawler
with settings
instance passed in:
crawler = Crawler(settings)
instantiate a spider (this is what it is all about eventually, right?):
spider = MySpider()
configure signals. This is an important step if you want to have a post-processing logic, collect stats or, at least, to ever finish crawling since the twisted reactor
needs to be stopped manually. Scrapy docs suggest to stop the reactor
in the spider_closed
signal handler:
请注意,在爬虫完成后,您还需要自行关闭Twisted反应器。这可以通过将处理程序连接到signals.spider_closed信号来实现。
def callback(spider, reason):
stats = spider.crawler.stats.get_stats()
reactor.stop()
crawler.signals.connect(callback, signal=signals.spider_closed)
configure and start crawler instance with a spider passed in:
crawler.configure()
crawler.crawl(spider)
crawler.start()
optionally start logging:
log.start()
start the reactor - this would block the script execution:
reactor.run()
这是一个使用DmozSpider
spider的自包含脚本示例,涉及item loaders、input and output processors和item pipelines:
import json
from scrapy.crawler import Crawler
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst
from scrapy import log, signals, Spider, Item, Field
from scrapy.settings import Settings
from twisted.internet import reactor
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
class DmozItemLoader(ItemLoader):
default_input_processor = MapCompose(unicode.strip)
default_output_processor = TakeFirst()
desc_out = Join()
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
loader = DmozItemLoader(DmozItem(), selector=sel, response=response)
loader.add_xpath('title', 'a/text()')
loader.add_xpath('link', 'a/@href')
loader.add_xpath('desc', 'text()')
yield loader.load_item()
def callback(spider, reason):
stats = spider.crawler.stats.get_stats()
reactor.stop()
settings = Settings()
settings.set('ITEM_PIPELINES', {
'__main__.JsonWriterPipeline': 100
})
crawler = Crawler(settings)
spider = DmozSpider()
crawler.signals.connect(callback, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
以平常的方式运行它:
python runner.py
使用管道功能观察导出到items.jl
的项目:
{"desc": "", "link": "/", "title": "Top"}
{"link": "/Computers/", "title": "Computers"}
{"link": "/Computers/Programming/", "title": "Programming"}
{"link": "/Computers/Programming/Languages/", "title": "Languages"}
{"link": "/Computers/Programming/Languages/Python/", "title": "Python"}
...
这里提供了Gist(欢迎改进):
注意:
如果您通过实例化Settings()
对象来定义settings
- 您将获得所有默认的Scrapy设置。但是,如果您想要例如配置现有的pipeline,或者配置DEPTH_LIMIT
或调整任何其他设置,则需要通过settings.set()
在脚本中设置它(如示例所示):
pipelines = {
'mypackage.pipelines.FilterPipeline': 100,
'mypackage.pipelines.MySQLPipeline': 200
}
settings.set('ITEM_PIPELINES', pipelines, priority='cmdline')
或者,使用已经预先配置了所有自定义设置的现有settings.py
:
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
关于该主题的其他有用链接: