Scrapy没有爬取所有链接。

Question

Scrapy没有爬取所有链接。

3

我想从http://community.sellfree.co.kr/提取数据。Scrapy正在工作，但它似乎只爬取了start_urls，没有爬取任何链接。

我希望蜘蛛程序可以爬行整个网站。

以下是我的代码：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from metacritic.items import MetacriticItem
class MetacriticSpider(BaseSpider):
    name = "metacritic" # Name of the spider, to be used when crawling
    allowed_domains = ["sellfree.co.kr"] # Where the spider is allowed to go
    start_urls = [
        "http://community.sellfree.co.kr/"
    ]
    rules = (Rule (SgmlLinkExtractor(allow=('.*',))
          ,callback="parse", follow= True),
        )

    def parse(self, response):
        hxs = HtmlXPathSelector(response) # The XPath selector
        sites = hxs.select('/html/body')
        items = []
        for site in sites:
            item = MetacriticItem()
            item['title'] = site.select('//a[@title]').extract()
            items.append(item)
        return items

页面上有两种链接。一种是onclick="location='../bbs/board.php?bo_table=maket_5_3'，另一种是<a href="../bbs/board.php?bo_table=maket_5_1&sca=프로그램/솔루션"><span class="list2">solution</span></a>。

我该如何让爬虫跟随这两种链接？

- user3138338

дЅ еє”иЇҐз»§ж‰їи‡ЄCrawlSpiderпјље°ќиЇ•дЅїз”Ёclass MetacriticSpider(CrawlSpider):гЂ‚ - paul trmbrth

尝试规则：rules = [Rule(SgmlLinkExtractor(allow=("http://www.sellfree.co.kr/.*(\.html)$")), callback='parse_item', follow=True),] 或 allow="http://www.sellfree.co.kr/"... 尝试在允许的链接中使用正则表达式。 - Vipul

sellfree.co.kr存在两种链接。 - user3138338

Vipul Sharma，感谢您的回复，但是您的解决方案并没有起作用。抱歉。 - user3138338

保罗，无论是否使用CrawlSpider都没有关系。 - user3138338

要使用规则，您需要使用CrawlSpider，并且不要覆盖parse方法，请尝试使用其他用户建议的parse_item方法。 - R. Max

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rejected · Accepted Answer

在开始之前，我强烈建议您使用更新版本的Scrapy。看起来您仍在使用旧版，因为您使用的许多方法/类已经移动或不建议使用。

接下来要解决的问题是：scrapy.spiders.BaseSpider类不会处理您指定的规则。相反，使用scrapy.contrib.spiders.CrawlSpider类，该类具有内置的处理规则功能。

接下来，您需要将parse()方法更改为新名称，因为CrawlSpider在内部使用parse()方法工作。（我们假设对于本答案的其余部分使用parse_page()）

要获取所有基本链接并进行爬取，您的链接提取器需要更改。默认情况下，您不应为要关注的域使用正则表达式语法。以下内容将捕获它，且DUPEFILTER将过滤掉不在网站上的链接：

rules = (
    Rule(SgmlLinkExtractor(allow=('')), callback="parse_page", follow=True),
)

关于 onclick=... 链接，这些是 JavaScript 链接，而您试图处理的页面在很大程度上依赖它们。 Scrapy 无法爬取诸如 onclick=location.href="javascript:showLayer_tap('2')" 或 onclick="win_open('./bbs/profile.php?mb_id=wlsdydahs' 这样的内容，因为它无法在 JavaScript 中执行 showLayer_tap() 或 win_open()。

您可以编写自己的函数来解析这些内容。例如，以下内容可以处理 onclick=location.href="./photo/"：

def process_onclick(value):
    m = re.search("location.href=\"(.*?)\"", value)
    if m:
        return m.group(1)

然后添加以下规则（这只处理表格，根据需要扩展）：

Rule(SgmlLinkExtractor(allow=(''), tags=('table',), 
                       attrs=('onclick',), process_value=process_onclick), 
     callback="parse_page", follow=True),