Scrapy爬虫 - 如何指定要爬取的链接

Question

Scrapy爬虫 - 如何指定要爬取的链接

3

我将使用Scrapy爬取我的网站http://www.cseblog.com。

以下是我的爬虫代码：

from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup ## This is BeautifulSoup4
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from blogscraper.items import BlogArticle ## This is for saving data. Probably insignificant.

class BlogArticleSpider(BaseSpider):
    name = "blogscraper"
    allowed_domains = ["cseblog.com"]
    start_urls = [
        "http://www.cseblog.com/",
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=('\d+/\d+/*"', ), deny=( ))),
    )

    def parse(self, response):
        site = BeautifulSoup(response.body_as_unicode())
        items = []
        item = BlogArticle()
        item['title'] = site.find("h3" , {"class": "post-title" } ).text.strip()
        item['link'] = site.find("h3" , {"class": "post-title" } ).a.attrs['href']
        item['text'] = site.find("div" , {"class": "post-body" } )
        items.append(item)
        return items

我应该在哪里指定需要递归爬取以下类型的链接：

http://www.cseblog.com/{d+}/{d+}/{*}.html
http://www.cseblog.com/search/{*}

但只保存来自以下链接的数据：

http://www.cseblog.com/{d+}/{d+}/{*}.html

- Pratik Poddar

我可能有点急于求成，Rule和SgmLinkExtractor是来自Scrapy还是BeautifulSoup？如果你不彻底了解这些模块，没有导入语句的话就不太清楚了。 - Torxed

修复了，先生。添加了导入语句。请现在给予建议。谢谢。 - Pratik Poddar

为什么要在自己的网站上使用爬虫？如果目的是将数据存入数据库，那么你可以直接在自己的数据库中运行查询吗？ - halfer

@halfer，可能是为了测试目的。 - Medeiros

@halfer 这是一个测试例子。 - Pratik Poddar

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Biswanath · Accepted Answer

你需要创建两个规则或一个规则，告诉Scrapy允许这些类型的URL。基本上，你希望规则列表看起来像这样。

rules = (
        Rule(SgmlLinkExtractor(allow=('http://www.cseblog.com/{d+}/{d+}/{*}.html', ), deny=( )),call_back ='parse_save' ),
        Rule(SgmlLinkExtractor(allow=('http://www.cseblog.com/search/{*}', ), deny=( )),,call_back = 'parse_only' ))

顺便提一下，你应该使用爬虫和重命名解析方法名称，除非你想覆盖来自基类的方法。

两种链接类型具有不同的回调函数，实际上，你可以决定要保存哪个已处理页面数据。而不是只有一个回调函数，并再次检查response.url。