这一定要用Java解决方案吗?PhantomJs和pjscrape结合使用可以对网页进行爬取,以查找所有的url。
你只需要创建一个配置的javascript文件即可。
getlinks.js:
pjs.addSuite({
url: 'https://dev59.com/fGzXa4cB1Zd3GeqPXMGn',
noConflict: true,
scraper: function() {
var links = _pjs.$('a').map(function() {
var link = _pjs.toFullUrl($(this).attr('href'));
return link;
});
return links.toArray();
}
});
pjs.config({
log: 'stdout',
format: 'json',
writer: 'stdout',
scrape_output.json
});
执行命令phantomjs pjscrape.js getlinks.js
。在这个例子中,输出结果被存储在一个文件中(也可以记录在控制台中):
以下是(部分)输出结果:
* Suite 0 starting
* Opening https://dev59.com/fGzXa4cB1Zd3GeqPXMGn
* Scraping https://dev59.com/fGzXa4cB1Zd3GeqPXMGn
* Suite 0 complete
* Writing 145 items
["http://stackoverflow.com/users/login?returnurl=%2fquestions%2f14138297%2freplace-all-urls-in-a-html","http://careers.stackoverflow.com","http://chat.stackoverflow.com","http://meta.stackoverflow.com","http://stackoverflow.com/about","http://stackoverflow.com/faq","http://stackoverflow.com/","http://stackoverflow.com/questions","http://stackoverflow.com/tags","http://stackoverflow.com/users","http://stackoverflow.com/badges","http://stackoverflow.com/unanswered","http://stackoverflow.com/questions/ask", ...
"http://creativecommons.org/licenses/by-sa/3.0/","http://creativecommons.org/licenses/by-sa/3.0/","http://blog.stackoverflow.com/2009/06/attribution-required/"]
* Saved 145 items