使用R进行网络爬虫

4
我遇到了一些从网站上抓取数据的问题。首先,我没有很多网络爬虫方面的经验...... 我的计划是使用R从以下网站抓取一些数据: http://spiderbook.com/company/17495/details?rel=300795 尤其是,我想提取该网站上的文章链接。
到目前为止,我的想法是:
xmltext <- htmlParse("http://spiderbook.com/company/17495/details?rel=300795")
sources <- xpathApply(xmltext,  "//body//div")
sourcesCharSep <- lapply(sourcesChar,function(x) unlist(strsplit(x, " "))) 
sourcesInd <- lapply(sourcesCharSep,function(x) grep('"(http://[^"]*)"',x)) 

但是这并没有提供预期的信息。在这里需要一些帮助!谢谢!最好的,Christoph

首先,你需要在第一行中将url放在引号中... - jlhoward
没错,就是这样。我只是把它粘贴在那里,因为它是一个更大的脚本的一部分,该脚本正在循环多个页面和URL。 - CKre
4个回答

10

你选择了一个难解的问题来学习。

该网站使用JavaScript加载文章信息。换句话说,链接加载一组脚本,在页面加载时运行这些脚本以获取信息(可能来自数据库)并将其插入到DOM中。 htmlParse(...) 只是获取基础HTML并进行解析。因此,您想要的链接不存在。

据我所知,唯一的解决方法是使用RSelenium软件包。该软件包基本上允许您通过看起来像浏览器模拟器的方式传递基础HTML,该模拟器可以运行脚本。Rselenium的问题在于,您不仅需要下载该软件包,而且还需要"Selenium服务器"。 此链接有有关RSelenium的良好介绍。

完成后,通过浏览器检查源代码,可以发现所有文章链接都在具有class = doclink 的锚点标签的href 属性中。这很容易使用xPath提取。永远不要使用正则表达式解析XML。

library(XML)
library(RSelenium)
url <- "http://spiderbook.com/company/17495/details?rel=300795"
checkForServer()        # download Selenium Server, if not already presnet
startServer()           # start Selenium Server
remDr <- remoteDriver() # instantiates a new driver
remDr$open()            # open connection
remDr$navigate(url)     # grab and process the page (including scripts)
doc   <- htmlParse(remDr$getPageSource()[[1]])
links <- as.character(doc['//a[@class="doclink"]/@href'])
links
# [1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
# [2] "http://insideevs.com/category/vw/"                                                                                    
# [3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"                                                             
# [4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"     
# [5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"                            
# [6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"                                           
# [7] "http://www.calcharge.org/2014/07/"                                                                                    
# [8] "http://nl.anygator.com/search/volkswagen+winterbanden"                                                                
# [9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"

你能否使用浏览器的网络检查器获取正在获取数据的ajax URL?然后只需GET它即可。 - Spacedman
你能否请看一下这个问题?https://stackoverflow.com/questions/66996370/r-error-in-f-x1l-y1l-scheme-not-supported-in-url-na 谢谢! - stats_noob

9
如@jihoward所提到的,RSelenium可以解决这个问题,无需检查网络流量/分析底层网站以找到适当的数量。此外,我要注意的是,如果用户系统安装了phantomjs,则RSelenium可以在没有Selenium Server的情况下运行。在这种情况下,RSelenium可以直接驱动phantomjs。有一个与无头浏览相关的vignette,位于http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html

用浏览器检查Web流量

然而,在这种情况下,检查网络流量会调用以下json文件:http://spiderbook.com/company/details/docs?rel=300795&docs_page=0,它没有cookie保护或对用户代理字符串等敏感。在这种情况下,可以执行以下操作:

library(RJSONIO)
res <- fromJSON("http://spiderbook.com/company/details/docs?rel=300795&docs_page=0")
> sapply(res$data, "[[", "url")
[1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
[2] "http://insideevs.com/category/vw/"                                                                                    
[3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"                                                             
[4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"     
[5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"                            
[6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"                                           
[7] "http://www.calcharge.org/2014/07/"                                                                                    
[8] "http://nl.anygator.com/search/volkswagen+winterbanden"                                                                
[9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"      

检查web流量并为phantomJS编写简单函数

使用RSeleniumphantomJS,我们可以在使用phantomJS时实时检查流量。这里有一个简单的例子,我们记录当前浏览网页所请求和接收到的调用,并将其存储在我们当前工作目录中的“traffic.txt”文件中:

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
appUrl <- "http://spiderbook.com/company/17495/details?rel=300795"
psScript <- "var page = this;
             var fs = require(\"fs\");
             fs.write(\"traffic.txt\", 'WEBSITE CALLS\\n', 'w');
             page.onResourceRequested = function(request) {
                fs.write(\"traffic.txt\", 'Request: ' + request.url + '\\n', 'a');
             };
             page.onResourceReceived = function(response) {
                fs.write(\"traffic.txt\", 'Receive: ' + response.url + '\\n', 'a');
             };"

result <- remDr$phantomExecute(psScript)

remDr$navigate(appUrl)
urlTraffic <- readLines("traffic.txt")
> head(urlTraffic)
[1] "WEBSITE CALLS"                                                        
[2] "Request: http://spiderbook.com/company/17495/details?rel=300795"      
[3] "Receive: http://spiderbook.com/company/17495/details?rel=300795"      
[4] "Request: http://spiderbook.com/static/js/jquery-1.10.2.min.js"        
[5] "Request: http://spiderbook.com/static/js/lib/jquery.dropkick-1.0.2.js"
[6] "Request: http://spiderbook.com/static/js/jquery.textfill.js"          

> urlTraffic[grepl("Receive: http://spiderbook.com/company/details", urlTraffic)]
[1] "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"
[2] "Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"

pJS$stop() # stop phantomJS

这里我们可以看到其中一个接收到的文件是"Receive: http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"

使用phantomJS/ghostdriver内置的HAR支持来检查流量

事实上,phantomJS/ghostscript会创建自己的HAR文件,因此当我们在驱动phantomJS时浏览页面,我们已经可以访问所有的请求/响应数据:

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantom")
remDr$open()
appUrl <- "http://spiderbook.com/company/17495/details?rel=300795"
remDr$navigate(appUrl)
harLogs <- remDr$log("har")[[1]]
harLogs <- fromJSON(harLogs$message)
# HAR contain alot of detail will just illustrate here accessing the data
requestURLs <- sapply(lapply(harLogs$log$entries, "[[", "request"), "[[","url")
requestHeaders <- lapply(lapply(harLogs$log$entries, "[[", "request"), "[[", "headers")
XHRIndex <- which(grepl("XMLHttpRequest", sapply(requestHeaders, sapply, "[[", "value")))

> harLogs$log$entries[XHRIndex][[1]]$request$url
[1] "http://spiderbook.com/company/details/docs?rel=300795&docs_page=0"

因此,最后一个示例展示了如何通过 phantomJS 生成的HAR文件来查询 XMLHttpRequest 请求,并返回特定的URL,正如我们希望在答案开头找到的JSON文件相对应。


你能否请看一下这个问题?https://stackoverflow.com/questions/66996370/r-error-in-f-x1l-y1l-scheme-not-supported-in-url-na 谢谢! - stats_noob

2
任何浏览器的网络检查器都可以告诉你它从哪里获取数据。在这种情况下,它似乎是通过http://spiderbook.com/company/details/docs?rel=300795获取JSON数据 - 这意味着只需要用jsonlite进行两行代码即可:
> require(jsonlite)
> x=fromJSON("http://spiderbook.com/company/details/docs?rel=300795")
> x$data$url
[1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
[2] "http://insideevs.com/category/vw/"                                                                                    
[3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"                                                             
[4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"     
[5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"                            
[6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"                                           
[7] "http://www.calcharge.org/2014/07/"                                                                                    
[8] "http://nl.anygator.com/search/volkswagen+winterbanden"                                                                
[9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"      

我猜测JSON中的这一部分告诉你返回数据是否有更多页面:
> x$has_next
[1] FALSE

我猜测网址中有一个参数可以获取特定页面的数据。

如何从公共网址中获取JSON网址?我不是很确定,因为我不知道“17495”在那里起什么作用...


0
    library(rvest)
    library(tidyverse)
    library(janitor)
    
    wiki_link2<-"https://en.wikipedia.org/wiki/Economy_of_China"
    
    wiki_page2<-read_html(wiki_link2)
    
    gdp_table2= wiki_page2 %>%  
      html_nodes("table") %>% html_table()%>% .[4] %>% .[[1]]
    .[4] %>% html_table()%>%.[[1]]
    
    View(gdp_table2)
    gdp_table2<-gdp_table2 %>% clean_names(case="snake")
    names(gdp_table2)
    
    gdp_table2$inflation_rate_in_percent<-as.numeric(sub("%","",gdp_table2$inflation_rate_in_percent))
    finaltable$unemploymentInPercent<-as.numeric(sub("%","",finaltable$unemploymentInPercent))



#Changing data into long form
nifty_50_long <- melt(nifty_50, id.vars= "weightage")

#Long to wide
goldsilver_wide<-dcast(goldsilver_long, variable ~ year, value.var="value")
view(goldsilver_wide)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接