如@jihoward所提到的,
RSelenium
可以解决这个问题,无需检查网络流量/分析底层网站以找到适当的数量。此外,我要注意的是,如果用户系统安装了
phantomjs
,则
RSelenium
可以在没有
Selenium Server
的情况下运行。在这种情况下,
RSelenium
可以直接驱动
phantomjs
。有一个与无头浏览相关的vignette,位于
http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html
用浏览器检查Web流量
然而,在这种情况下,检查网络流量会调用以下json文件:http://spiderbook.com/company/details/docs?rel=300795&docs_page=0,它没有cookie保护或对用户代理字符串等敏感。在这种情况下,可以执行以下操作:
library(RJSONIO)
res <- fromJSON("http://spiderbook.com/company/details/docs?rel=300795&docs_page=0")
> sapply(res$data, "[[", "url")
[1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
[2] "http://insideevs.com/category/vw/"
[3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"
[4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"
[5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"
[6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"
[7] "http://www.calcharge.org/2014/07/"
[8] "http://nl.anygator.com/search/volkswagen+winterbanden"
[9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"
检查web流量并为phantomJS编写简单函数
使用RSelenium
和phantomJS
,我们可以在使用phantomJS时实时检查流量。这里有一个简单的例子,我们记录当前浏览网页所请求和接收到的调用,并将其存储在我们当前工作目录中的“traffic.txt”文件中:
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
appUrl <- "http://spiderbook.com/company/17495/details?rel=300795"
psScript <- "var page = this;
var fs = require(\"fs\");
fs.write(\"traffic.txt\", 'WEBSITE CALLS\\n', 'w');
page.onResourceRequested = function(request) {
fs.write(\"traffic.txt\", 'Request: ' + request.url + '\\n', 'a');
};
page.onResourceReceived = function(response) {
fs.write(\"traffic.txt\", 'Receive: ' + response.url + '\\n', 'a');
};"
result <- remDr$phantomExecute(psScript)
remDr$navigate(appUrl)
urlTraffic <- readLines("traffic.txt")
> head(urlTraffic)
[1] "WEBSITE CALLS"
[2] "Request: http://spiderbook.com/company/17495/details?rel=300795"
[3] "Receive: http://spiderbook.com/company/17495/details?rel=300795"
[4] "Request: http://spiderbook.com/static/js/jquery-1.10.2.min.js"
[5] "Request: http://spiderbook.com/static/js/lib/jquery.dropkick-1.0.2.js"
[6] "Request: http://spiderbook.com/static/js/jquery.textfill.js"
> urlTraffic[grepl("Receive: http://spiderbook.com/company/details", urlTraffic)]
[1] "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
[2] "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
pJS$stop() # stop phantomJS
这里我们可以看到其中一个接收到的文件是"Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
。
使用phantomJS/ghostdriver内置的HAR支持来检查流量
事实上,phantomJS/ghostscript
会创建自己的HAR
文件,因此当我们在驱动phantomJS
时浏览页面,我们已经可以访问所有的请求/响应数据:
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
appUrl <- "http://spiderbook.com/company/17495/details?rel=300795"
remDr$navigate(appUrl)
harLogs <- remDr$log("har")[[1]]
harLogs <- fromJSON(harLogs$message)
requestURLs <- sapply(lapply(harLogs$log$entries, "[[", "request"), "[[","url")
requestHeaders <- lapply(lapply(harLogs$log$entries, "[[", "request"), "[[", "headers")
XHRIndex <- which(grepl("XMLHttpRequest", sapply(requestHeaders, sapply, "[[", "value")))
> harLogs$log$entries[XHRIndex][[1]]$request$url
[1] "http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
因此,最后一个示例展示了如何通过 phantomJS
生成的HAR文件来查询 XMLHttpRequest
请求,并返回特定的URL,正如我们希望在答案开头找到的JSON文件相对应。