我想使用Puppeteer获取页面的完整内容,对于普通页面,这个方法很有效,但是如果有
例如,如果https://example.com/thisredirects返回:
它无法获取原始的HTML。它会抛出
window.location
重定向,我想要阻止它并仅获取原始内容:例如,如果https://example.com/thisredirects返回:
<html>
<body>
<p>Page not found - Please wait while we redirect you home...</p>
<script type="text/javascript" language="javascript">
window.location = "//example.com";
</script>
</body>
</html>
我想获取HTML并阻止位置重定向。如果我尝试使用setRequestInterception
来阻止/中止位置更改,响应将返回null,并且实际上无法完全阻止重定向(对于重定向状态代码可行,但不适用于返回200然后使用window.location
进行重定向的页面):
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false });
const pageUrl = "https://example.com/thisredirects";
const page = await browser.newPage();
await page.setCacheEnabled(false);
await page.setRequestInterception(true);
const requests = [];
page.on('request', async request => {
let isNavRequest = request.isNavigationRequest() && request.frame() === page.mainFrame();
if (!isNavRequest) {
request.continue();
return;
}
requests.push(request);
if (requests.length == 1) {
console.log("Load initial page: " + request.url());
request.continue();
return;
}
console.log("Block redirect to: " + request.url());
request.abort();
});
let response;
try {
console.log(`Request: ${pageUrl}`);
response = await page.goto(pageUrl, { waitUntil: 'domcontentloaded' });
const content = await response.text();
console.log(content);
await page.close();
await browser.close();
}
catch (err) {
console.log(err);
}
})()
有没有一种方法可以阻止window.location
,并在不完全禁用Javascript的情况下获取原始HTML(如上所示)?
即使我倾听所有响应:
page.on('response', async response => {
if (response.ok && response.url() === pageUrl) {
console.log(await response.text());
}
});
它无法获取原始的HTML。它会抛出
Could not load body for this request. This might happen if the request is a preflight request.
。
page.setJavaScriptEnabled(false)
- GrafiCode