我正在编写一个 Scrapy 爬虫,每天会爬取一组 URL。然而,其中一些网站非常大,因此我无法每天全站爬取,也不想产生必要的大量流量。
一个旧问题(这里)问了类似的问题。然而,得到赞同的回答只是指向了一个代码片段(这里),它似乎需要请求实例的某些内容,但在回答中没有解释,也没有在包含代码片段的页面中说明。
我正在努力理解这个问题,但中间件有点令人困惑。一个完整的示例爬虫程序,可以多次运行而不会重新抓取URL,无论是否使用链接的中间件都将非常有用。
我发布了下面的代码来开始探讨,但我不一定需要使用这个中间件。任何能够每天爬取并提取新URL的Scrapy爬虫都可以。显然,一种解决方案是只需编写一个已爬取URL字典,然后检查每个新URL是否在字典中,但这似乎非常缓慢/低效。 爬虫
一个旧问题(这里)问了类似的问题。然而,得到赞同的回答只是指向了一个代码片段(这里),它似乎需要请求实例的某些内容,但在回答中没有解释,也没有在包含代码片段的页面中说明。
我正在努力理解这个问题,但中间件有点令人困惑。一个完整的示例爬虫程序,可以多次运行而不会重新抓取URL,无论是否使用链接的中间件都将非常有用。
我发布了下面的代码来开始探讨,但我不一定需要使用这个中间件。任何能够每天爬取并提取新URL的Scrapy爬虫都可以。显然,一种解决方案是只需编写一个已爬取URL字典,然后检查每个新URL是否在字典中,但这似乎非常缓慢/低效。 爬虫
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from cnn_scrapy.items import NewspaperItem
class NewspaperSpider(CrawlSpider):
name = "newspaper"
allowed_domains = ["cnn.com"]
start_urls = [
"http://www.cnn.com/"
]
rules = (
Rule(LinkExtractor(), callback="parse_item", follow=True),
)
def parse_item(self, response):
self.log("Scraping: " + response.url)
item = NewspaperItem()
item["url"] = response.url
yield item
项目
import scrapy
class NewspaperItem(scrapy.Item):
url = scrapy.Field()
visit_id = scrapy.Field()
visit_status = scrapy.Field()
Middlewares (ignore.py)
from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint
from cnn_scrapy.items import NewspaperItem
class IgnoreVisitedItems(object):
"""Middleware to ignore re-visiting item pages if they were already visited
before. The requests to be filtered by have a meta['filter_visited'] flag
enabled and optionally define an id to use for identifying them, which
defaults the request fingerprint, although you'd want to use the item id,
if you already have it beforehand to make it more robust.
"""
FILTER_VISITED = 'filter_visited'
VISITED_ID = 'visited_id'
CONTEXT_KEY = 'visited_ids'
def process_spider_output(self, response, result, spider):
context = getattr(spider, 'context', {})
visited_ids = context.setdefault(self.CONTEXT_KEY, {})
ret = []
for x in result:
visited = False
if isinstance(x, Request):
if self.FILTER_VISITED in x.meta:
visit_id = self._visited_id(x)
if visit_id in visited_ids:
log.msg("Ignoring already visited: %s" % x.url,
level=log.INFO, spider=spider)
visited = True
elif isinstance(x, BaseItem):
visit_id = self._visited_id(response.request)
if visit_id:
visited_ids[visit_id] = True
x['visit_id'] = visit_id
x['visit_status'] = 'new'
if visited:
ret.append(NewspaperItem(visit_id=visit_id, visit_status='old'))
else:
ret.append(x)
return ret
def _visited_id(self, request):
return request.meta.get(self.VISITED_ID) or request_fingerprint(request)