使用Node.js和request提取所有外部网站的超链接

Question

使用Node.js和request提取所有外部网站的超链接

23

现在我们的应用程序将nodejs.org的源代码写入控制台。

我们希望它写出nodejs.org的所有超链接。也许我们只需要一行代码从body中获取链接。

app.js：

var http = require('http');

http.createServer(function (req, res) {
    res.writeHead(200, {'Content-Type': 'text/plain'});
    res.end('Hello World\n');
}).listen(1337, '127.0.0.1');
console.log('Server running at http://127.0.0.1:1337/');

var request = require("request");



request("http://nodejs.org/", function (error, response, body) {
    if (!error)
        console.log(body);
    else
        console.log(error);
});

- Michael Moeller

2个回答

0

package.json

    {
      "name": "url_extractor",
      "version": "1.0.0",
      "description": "tool to extract all urls from website",
      "main": "index.js",
      "scripts": {
        "start": "node index.js",
        "test": "echo \"Error: no test specified\" && exit 1"
      },
      "author": "sandip shelke",
      "license": "ISC",
      "dependencies": {
        "axios": "^0.24.0",
        "cheerio": "^1.0.0-rc.10"
      }
    }

Index.js

        const axios = require('axios');
        var cheerio = require('cheerio');

        var baseUrl = 'target website base url';

        (async () => {
            
            try 
            {
                let homePageLinks = await getLinksFromURL(baseUrl)
                console.log(homePageLinks);
            } catch (e) { console.log(e); }

        })();



        async function getLinksFromURL(url) {

            try {
                let links = [];
                let httpResponse = await axios.get(url);

                let $ = cheerio.load(httpResponse.data);
                let linkObjects = $('a'); // get all hyperlinks

                linkObjects.each((index, element) => {
                    links.push({
                        text: $(element).text(), // get the text
                        href: $(element).attr('href'), // get the href attribute
                    });
                });

                return links;
            } catch (e) { console.log(e) }

        }

这段代码仅从主页获取链接，可以递归运行以加载网页中的所有链接。

在考虑您已经安装了Node的情况下，运行npm install，然后运行npm start来运行上述代码。

- Sandip Shelke

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user568109 · Accepted Answer

您可能正在寻找jsdom、jquery或cheerio。您正在进行的是屏幕抓取，即从站点提取数据。jsdom / jquery提供完整的工具集，但cheerio更快。

这是一个cheerio示例：

var request = require('request');
var cheerio = require('cheerio');
var searchTerm = 'screen+scraping';
var url = 'http://www.bing.com/search?q=' + searchTerm;
request(url, function(err, resp, body){
  $ = cheerio.load(body);
  links = $('a'); //jquery get all hyperlinks
  $(links).each(function(i, link){
    console.log($(link).text() + ':\n  ' + $(link).attr('href'));
  });
});

您可以选择最适合您的。