使用JavaScript（PhantomJS）导航/抓取Hashbang链接

Question

使用JavaScript（PhantomJS）导航/抓取Hashbang链接

9

我正在尝试下载一个几乎完全由JavaScript生成的网站的HTML。因此，我需要模拟浏览器访问，并一直在尝试使用PhantomJS。问题是，该网站使用hashbang URL，而我似乎无法让PhantomJS处理hashbang - 它只会调用首页。

该网站是http://www.regulations.gov。默认情况下，它会带你去#!home。我已经尝试使用以下代码（来自here）来尝试处理不同的hashbangs。

if (phantom.state.length === 0) {
     if (phantom.args.length === 0) {
        console.log('Usage: loadreg_1.js <some hash>');
        phantom.exit();
     }
     var address = 'http://www.regulations.gov/';
     console.log(address);
     phantom.state = Date.now().toString();
     phantom.open(address);

} else {
     var hash = phantom.args[0];
     document.location = hash;
     console.log(document.location.hash);
     var elapsed = Date.now() - new Date().setTime(phantom.state);
     if (phantom.loadStatus === 'success') {
             if (!first_time) {
                     var first_time = true;
                     if (!document.addEventListener) {
                             console.log('Not SUPPORTED!');
                     }
                     phantom.render('result.png');
                     var markup = document.documentElement.innerHTML;
                     console.log(markup);
                     phantom.exit();
             }
     } else {
             console.log('FAIL to load the address');
             phantom.exit();
     }
}

这段代码可以生成正确的hashbang（例如，我可以将hash设置为“＃！contactus”），但它不会动态生成任何不同的HTML - 只有默认页面。但是，当我调用document.location.hash时，它确实正确地输出了该哈希值。

我还尝试将初始地址设置为hashbang，但是然后脚本就会挂起并且不执行任何操作。例如，如果我将URL设置为http://www.regulations.gov/#!searchResults;rpp=10;po=0，则脚本在将地址打印到终端后就会挂起，什么也不会发生。

- tchaymore

这与Python有什么关系？ - Petr Viktorin

好的，我不知道为什么我把那个标签放在那里。 - tchaymore

我在Windows上尝试了一下，但很可能我成功了。 - mattn

@mattn -- 能否提供更多您所做的事情以及其是否成功的信息？ - tchaymore

我能看到result.png存在，并且HTML出现了。我没有遇到任何问题。 - mattn

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- nrabinowitz · Accepted Answer

问题在于页面内容是异步加载的，但您期望它在页面加载后立即可用。

为了爬取异步加载的页面，您需要等待感兴趣的内容被加载后再进行爬取。根据页面不同，可能有不同的检查方式，但最简单的方法就是定期检查您期望看到的某些内容，直到找到为止。

关键在于找出要查找的内容 - 您需要找到一些直到所需内容被加载后才会出现在页面上的东西。在这种情况下，我发现顶层页面的最简单选项是手动输入每个页面上所期望看到的H1标签，并将它们与哈希键配对。

var titleMap = {
    '#!contactUs': 'Contact Us',
    '#!aboutUs': 'About Us'
    // etc for the other pages
};

然后在成功块中，您可以设置一个重复的超时时间来查找您想要的标题在一个h1标签中。当它出现时，您就知道可以呈现页面了：

if (phantom.loadStatus === 'success') {
    // set a recurring timeout for 300 milliseconds
    var timeoutId = window.setInterval(function () {
        // check for title element you expect to see
        var h1s = document.querySelectorAll('h1');
        if (h1s) {
            // h1s is a node list, not an array, hence the
            // weird syntax here
            Array.prototype.forEach.call(h1s, function(h1) {
                if (h1.textContent.trim() === titleMap[hash]) {
                    // we found it!
                    console.log('Found H1: ' + h1.textContent.trim());
                    phantom.render('result.png');
                    console.log("Rendered image.");
                    // stop the cycle
                    window.clearInterval(timeoutId);
                    phantom.exit();
                }
            });
            console.log('Found H1 tags, but not ' + titleMap[hash]);
        }
        console.log('No H1 tags found.');
    }, 300);
}

上述代码对我有效。但如果您需要爬取搜索结果，则需要找到一个可以查找的标识元素或文本片段，而无需事先知道标题。

编辑：此外，看起来PhantomJS的最新版本现在在获取新数据时会触发onResourceReceived事件。我还没有研究过这个，但您可能能够绑定一个监听器来实现相同的效果。