如何将 Puppeteer 插件与 Puppeteer 集群组合使用?

4
我有一列需要从使用React的网站中爬取的URL,因此我使用Puppeteer。为了避免被反爬虫服务器封锁,我添加了puppeteer-extra-plugin-stealth。我想要阻止页面上的广告加载,所以我使用puppeteer-extra-plugin-adblocker来阻止广告。我还想要防止我的IP地址被列入黑名单,所以我使用TOR节点来获得不同的IP地址。下面是我的代码简化版本,设置起来可以工作(但是TOR_portwebUrl是动态分配的,但为了简化我的问题,我将它们分配为变量)。但是出现了一个问题:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');

puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());

var TOR_port = 13931;
var webUrl ='https://www.zillow.com/homedetails/2861-Bass-Haven-Ln-Saint-Augustine-FL-32092/47739703_zpid/';


const browser = await puppeteer.launch({
    dumpio: false,
    headless: false,
    args: [
        `--proxy-server=socks5://127.0.0.1:${TOR_port}`,
        `--no-sandbox`,
    ],
    ignoreHTTPSErrors: true,
});

try {
    const page = await browser.newPage();
    await page.setViewport({ width: 1280, height: 720 });
    await page.goto(webUrl, {
        waitUntil: 'load',
        timeout: 30000,
    });

    page
    .waitForSelector('.price')
    .then(() => {
        console.log('The price is available');
        await browser.close();
    })
    .catch(() => {
        // close this since it is clearly not a zillow website
        throw new Error('This is not the zillow website');
    });
} catch (e) {
    await browser.close();
}

上述设置可以工作,但非常不可靠,我最近了解到Puppeteer-Cluster。我需要它来帮助我管理爬取多个页面以跟踪我的爬取任务。

那么,我的问题是如何在上述设置中实现Puppeteer-Cluster。我知道该库提供了一个示例(https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/different-puppeteer-library.js),展示了如何实现插件,但它太简单了,我没能完全理解它。

我如何使用上述TOR、AdBlocker和Stealth配置实现Puppeteer-Cluster?


1
也许你可以阅读这个作为参考 https://github.com/thomasdondorf/puppeteer-cluster/issues/228#issue-530751635 - Edi Imanto
2个回答

6
你可以像下面这样直接把你的 Puppeteer 实例交给其他函数使用:
const puppeteer = require('puppeteer-extra');
const _StealthPlugin = require('puppeteer-extra-plugin-stealth');
const _AdblockerPlugin = require('puppeteer-extra-plugin-adblocker');

puppeteer.use(_StealthPlugin());
puppeteer.use(_AdblockerPlugin());

const browser = await puppeteer.launch({
    puppeteer,
});

来源:https://github.com/thomasdondorf/puppeteer-cluster#clusterlaunchoptions

这个链接是关于puppeteer集群启动选项的文档。

1

您可以使用puppeteer.use()来添加插件。

您需要使用puppeteer-extra。

const { addExtra } = require("puppeteer-extra");
const vanillaPuppeteer = require("puppeteer");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const RecaptchaPlugin = require("puppeteer-extra-plugin-recaptcha");
const { Cluster } = require("puppeteer-cluster");

(async () => {
  const puppeteer = addExtra(vanillaPuppeteer);
  puppeteer.use(StealthPlugin());
  puppeteer.use(RecaptchaPlugin());

  // Do stuff
})();


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接