跨多个页面进行 R 网络抓取

Question

跨多个页面进行 R 网络抓取

8

我正在开发一个网络爬虫程序，用于搜索特定的葡萄酒并返回该品种的本地葡萄酒列表。我的问题是多页结果。下面的代码是我正在使用的基本示例。

url2 <- "http://www.winemag.com/?s=washington+merlot&search_type=reviews"
htmlpage2 <- read_html(url2)
names2 <- html_nodes(htmlpage2, ".review-listing .title")
Wines2 <- html_text(names2)

针对这个特定的搜索，共有39页结果。我知道URL会更改为http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=2，但是否有一种简单的方法让代码循环遍历所有返回的页面，并将所有39页的结果编译成一个列表？我知道可以手动处理所有URL，但那似乎太过繁琐。

- Jamie Leigh

2个回答

9

您可以使用lapply函数跨越一个URL向量，您可以通过将基本URL粘贴到序列中来创建该向量：

library(rvest)

wines <- lapply(paste0('http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=', 1:39),
                function(url){
                    url %>% read_html() %>% 
                        html_nodes(".review-listing .title") %>% 
                        html_text()
                })

结果将以列表形式返回，每个页面都有一个元素。

- alistaire

非常棒的Alistaire!! 你能解释一下这是如何工作的吗？谢谢。 - ASH

2

它将URL的向量粘合在一起，每个页面一个URL，然后lapply在每个URL上运行该函数。该函数是一个rvest链，它读取该URL处的HTML，选择具有指定类别（即标题）的节点，并从这些节点内获取文本。它为每次运行该函数返回一个列表项，但如果您想将它们全部折叠成一个向量，只需运行unlist(wines)。如果您还想获取每种葡萄酒的其他元素，可以将它们全部组装成一个数据框。 - alistaire

如果我想要点击每一行的“查看完整评论”并打开一个新的网页，我需要使用RSelenium吗？ - Mostafa90

1

不，每一行都包含在一个带有“href”属性的<a>标签中，该属性是评论的URL，因此您可以使用类似于“page％>％html_nodes（'a.review-listing'）％>％html_attr（'href'）”的内容获取URL向量以进行进一步处理。 - alistaire

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hrbrmstr · Accepted Answer

如果您想要将所有信息作为data.frame，则可以使用purrr :: map_df() 来执行类似的操作：

library(rvest)
library(purrr)

url_base <- "http://www.winemag.com/?s=washington merlot&drink_type=wine&page=%d"

map_df(1:39, function(i) {

  # simple but effective progress indicator
  cat(".")

  pg <- read_html(sprintf(url_base, i))

  data.frame(wine=html_text(html_nodes(pg, ".review-listing .title")),
             excerpt=html_text(html_nodes(pg, "div.excerpt")),
             rating=gsub(" Points", "", html_text(html_nodes(pg, "span.rating"))),
             appellation=html_text(html_nodes(pg, "span.appellation")),
             price=gsub("\\$", "", html_text(html_nodes(pg, "span.price"))),
             stringsAsFactors=FALSE)

}) -> wines

dplyr::glimpse(wines)
## Observations: 1,170
## Variables: 5
## $ wine        (chr) "Charles Smith 2012 Royal City Syrah (Columbia Valley (WA)...
## $ excerpt     (chr) "Green olive, green stem and fresh herb aromas are at the ...
## $ rating      (chr) "96", "95", "94", "93", "93", "93", "93", "93", "93", "93"...
## $ appellation (chr) "Columbia Valley", "Columbia Valley", "Columbia Valley", "...
## $ price       (chr) "140", "70", "70", "20", "70", "40", "135", "50", "60", "3...