Puppeteer获取href数组，然后迭代每个href以及该页面上的href。

Question

Puppeteer获取href数组，然后迭代每个href以及该页面上的href。

javascripthtmlnode.jsarrayspuppeteer

3

我正在尝试使用node.js中的puppeteer来爬取数据。

目前，我正在编写一个脚本，用于在well.ca的特定部分中爬取所有数据。

现在，这是我正在尝试通过node.js实现的方法/逻辑：

1 - 转到网站的“药品健康”部分

2 - 使用dom选择器从.panel-body-content获取href数组，通过dom选择器panel-body-content a[href]来爬取子部分

3 - 通过for循环迭代每个链接（子部分）

4 - 对于每个子部分链接，通过.col-lg-5ths col-md-3 col-sm-4 col-xs-6 a[href]获取每个产品的href数组，其中class值为col-lg-5ths col-md-3 col-sm-4 col-xs-6

5 - 循环遍历子部分中的每个产品

6 - 爬取每个产品的数据

目前，我已经编写了上述大部分代码。

const puppeteer = require('puppeteer');
const chromeOptions = {
  headless: false,
  defaultViewport: null,
};
(async function main() {
  const browser = await puppeteer.launch(chromeOptions);
  try {
    const page = await browser.newPage();
    await page.goto("https://well.ca/categories/medicine-health_2.html");
    console.log("::::::: OPEN WELL   ::::::::::");

    // href attribute
    const hrefs1 = await page.evaluate(
      () => Array.from(
        document.querySelectorAll('.panel-body-content a[href]'),
       a => a.getAttribute('href')
     )
   );
    
    console.log(hrefs1);

    const urls = hrefs1

    for (let i = 0; i < urls.length; i++) {
      const url = urls[i];
      await page.goto(url);
    }
      const hrefs2 = await page.evaluate(
     () => Array.from(
      document.querySelectorAll('.col-lg-5ths col-md-3 col-sm-4 col-xs-6 a[href]'),
       a => a.getAttribute('href')
     )
    );

当我尝试为每个产品获取href数组时，数组中没有任何内容。

如何添加一个嵌套的for循环，以获取每个子部分中每个产品的所有href，并访问每个产品链接？

获取在类.col-lg-5ths col-md-3 col-sm-4 col-xs-6和id product_grid_link内的所有href的正确DOM选择器是什么？

如果我想添加一个后续循环，通过每个子节的产品href获取每个产品的信息，该怎么将其嵌入到代码中？

非常感谢任何帮助

- Tekky

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- vsemozhebuty · Accepted Answer

看起来有一些链接是重复的，因此最好收集所有最终页面的链接，去重链接列表，然后再抓取最终页面。（您也可以将最终页面的链接保存在文件中以便之后使用。）此脚本收集了5395个链接（去重后）。

'use strict';

const puppeteer = require('puppeteer');

(async function main() {
  try {
    const browser = await puppeteer.launch({ headless: false, defaultViewport: null });
    const [page] = await browser.pages();

    await page.goto('https://well.ca/categories/medicine-health_2.html');

    const hrefsCategoriesDeduped = new Set(await page.evaluate(
      () => Array.from(
        document.querySelectorAll('.panel-body-content a[href]'),
        a => a.href
      )
    ));

    const hrefsPages = [];

    for (const url of hrefsCategoriesDeduped) {
      await page.goto(url);
      hrefsPages.push(...await page.evaluate(
        () => Array.from(
          document.querySelectorAll('.col-lg-5ths.col-md-3.col-sm-4.col-xs-6 a[href]'),
          a => a.href
        )
      ));
    }

    const hrefsPagesDeduped = new Set(hrefsPages);

    // hrefsPagesDeduped can be converted back to an array
    // and saved in a JSON file now if needed.

    for (const url of hrefsPagesDeduped) {
      await page.goto(url);

      // Scrape the page.
    }

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();