类似于Scrapy的Nodejs工具？

Question

类似于Scrapy的Nodejs工具？

javascriptnode.jsweb-scrapingscrapycheerio

12

我想知道是否有类似于Scrapy的Node.js框架？如果没有，您认为使用简单页面下载并使用cheerio解析它怎么样？是否有更好的方法。

我想知道是否有像Scrapy一样适用于Node.js的框架。如果没有，您认为使用简单的页面下载并使用cheerio解析它如何？是否有更好的方法。

- user2422940

5个回答

3

个人认为，像 Python Scrapy 这样的程序在整网站爬取/索引方面非常强大。但是，在从页面中抓取数据方面，nodejs 的 casperjs 更加实用和出色。它还可以用于解析 AJAX 网站，例如 AngularJS 页面。而 Python Scrapy 无法解析 AJAX 页面。因此，在仅需要爬取一两个页面的数据时，我更喜欢使用 CasperJs。

Cheerio 比 casperjs 更快，但是它不能处理 AJAX 页面，并且其代码结构不如 casperjs 好。即便可以使用 cheerio 包，我仍然更偏向于使用 casperjs。

以下是 Coffee-script 示例：

casper.start 'https://reports.something.com/login', ->
  this.fill 'form',
    username: params.username
    password: params.password
  , true

casper.thenOpen queryUrl, {method:'POST', data:queryData}, ->
  this.click 'input'

casper.then ->
  get = (number) =>
    value = this.fetchText("tr[bgcolor= '#AFC5E4'] >  td:nth-of-type(#{number})").trim()

- Stan

Scrapy可以使用扩展程序（例如：Scrapy Playwright、Scrapy Pypputeer、Scrapy Splash等）来解析ajax页面。这些项目都是由不同的Scrapy维护者共同开发的。 - Kalnode

2

完全一样吗？不是。但是强大而简单易懂吗？是的：爬虫以下是一个快速示例：

var Crawler = require("crawler");

var c = new Crawler({
    maxConnections : 10,
    // This will be called for each crawled page
    callback : function (error, res, done) {
        if(error){
            console.log(error);
        }else{
            var $ = res.$;
            // $ is Cheerio by default
            //a lean implementation of core jQuery designed specifically for the server
            console.log($("title").text());
        }
        done();
    }
});

// Queue just one URL, with default callback
c.queue('http://www.amazon.com');

// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);

// Queue URLs with custom callbacks & parameters
c.queue([{
    uri: 'http://parishackers.org/',
    jQuery: false,

    // The global callback won't be called
    callback: function (error, res, done) {
        if(error){
            console.log(error);
        }else{
            console.log('Grabbed', res.body.length, 'bytes');
        }
        done();
    }
}]);

// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
    html: '<p>This is a <strong>test</strong></p>'
}]);

- Mike

除了需要学习/了解Python之外，有没有其他原因会导致某人不使用Scrapy，而是从头开始创建自定义的nodejs爬虫？Scrapy似乎具有大量有用的功能。我认为我宁愿学习基本的Python并使用Scrapy，而不是尝试自己构建一个等效的爬虫。 - Kalnode

1

使用Google Puppeteer可以实现一些爬取功能。根据文档:

你在浏览器中手动操作的大多数事情都可以使用Puppeteer完成！以下是一些示例供您参考：

生成页面的截图和PDF。
爬取单页应用程序(SPA)并生成预渲染内容（即“SSR”（服务器端渲染））。
自动化表单提交、UI测试、键盘输入等操作。
创建最新版本的Chrome上直接运行最新JavaScript和浏览器特性的自动化测试环境。
捕获网站的时间轴跟踪以帮助诊断性能问题。
测试Chrome扩展程序。

- JP Ventura

-1

如果您仍需要答案， https://www.npmjs.org/package/scrapy 我从未测试过它，但认为它可能会有所帮助。祝愉快地爬取。

- P.M

1

这个模块无法进行配置。它只返回商家名称和电话。我找到了一个可能的解决方案，虽然不如Scrappy高效，但是通过使用Cheerio可以操作页面，就像使用Jquery一样。 - user2422940

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- pguardiario · Accepted Answer

Scrapy是一个在Python中添加异步IO的库。我们为什么没有类似于Node的东西，是因为所有的IO已经是异步的（除非您不需要它）。

以下是一个在Node中实现的Scrapy脚本示例，注意URL是并行处理的。

const cheerio = require('cheerio');
const axios = require('axios');

const startUrls = ['http://www.google.com/', 'http://www.amazon.com/', 'http://www.wikipedia.com/']

// this might be called a "middleware" in scrapy.
const get = async url => {
  const response = await axios.get(url)
  return cheerio.load(response.data)
}

// this too.
const output = item => {
  console.log(item)
}

// here is parse which is the initial scrapy callback
const parse = async url => {
  const $ = await get(url)
  output({url, title: $('title').text()})
}

// and here is the main execution
startUrls.map(url => parse(url))