使用R语言抓取.asp网站

7
我正在抓取http://www.progarchives.com/album.asp?id=并收到一个警告信息:

警告信息:
XML内容似乎不是XML:
http://www.progarchives.com/album.asp?id=2
http://www.progarchives.com/album.asp?id=3 http://www.progarchives.com/album.asp?id=4
http://www.progarchives.com/album.asp?id=5

该爬虫适用于每个页面,但不适用于b1=2:b2=1000的URL。
 library(RCurl)
 library(XML)

getUrls <- function(b1,b2){
   root="http://www.progarchives.com/album.asp?id="
   urls <- NULL
     for (bandid in b1:b2){
   urls <- c(urls,(paste(root,bandid,sep="")))
  }
  return(urls)
}

prog.arch.scraper <- function(url){
SOURCE <- getUrls(b1=2,b2=1000)
PARSED <- htmlParse(SOURCE)
album <- xpathSApply(PARSED,"//h1[1]",xmlValue)
date <- xpathSApply(PARSED,"//strong[1]",xmlValue)
band <- xpathSApply(PARSED,"//h2[1]",xmlValue)
return(c(band,album,date))
}

prog.arch.scraper(urls)
2个回答

6

以下是使用 rvestdplyr 的另一种方法:

library(rvest)
library(dplyr)
library(pbapply)

base_url <- "http://www.progarchives.com/album.asp?id=%s"

get_album_info <- function(id) {

  pg <- html(sprintf(base_url, id))
  data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
             date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
             band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
             stringsAsFactors=FALSE)

}

albums <- bind_rows(pblapply(2:10, get_album_info))

head(albums)

## Source: local data frame [6 x 3]
## 
##                        album                           date      band
## 1                    FOXTROT Studio Album, released in 1972   Genesis
## 2              NURSERY CRYME Studio Album, released in 1971   Genesis
## 3               GENESIS LIVE         Live, released in 1973   Genesis
## 4        A TRICK OF THE TAIL Studio Album, released in 1976   Genesis
## 5 FROM GENESIS TO REVELATION Studio Album, released in 1969   Genesis
## 6           GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz

我不想向网站发送大量请求,所以您可以提高序列以供您使用。 pblapply 可以为您提供免费的进度条。

为了对网站友好(特别是因为它没有明确禁止抓取),您可能想在 get_album_info 函数的结尾添加一个 Sys.sleep(10)

更新

为了处理服务器错误(在这种情况下是 500,但也适用于其他错误),您可以使用 try

library(rvest)
library(dplyr)
library(pbapply)
library(data.table)

base_url <- "http://www.progarchives.com/album.asp?id=%s"

get_album_info <- function(id) {

  pg <- try(html(sprintf(base_url, id)), silent=TRUE)

  if (inherits(pg, "try-error")) {
    data.frame(album=character(0), date=character(0), band=character(0))
  } else {
    data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
               date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
               band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
               stringsAsFactors=FALSE)
  }

}

albums <- rbindlist(pblapply(c(9:10, 23, 28, 29, 30), get_album_info))

##                       album                           date         band
## 1: THE DANGERS OF STRANGERS Studio Album, released in 1988    Abel Ganz
## 2:    THE DEAFENING SILENCE Studio Album, released in 1994    Abel Ganz
## 3:             AD INFINITUM Studio Album, released in 1998 Ad Infinitum

您不会获得有关错误页面的任何条目(在这种情况下,它只返回id为9、10和30的条目)。

谢谢!它有效了,除了我收到一个错误消息,说没有函数"bind_rows"。我重新安装了所有的包,但还是没有成功。 - monarque13
rbindlist解决了问题。我一直想学习rvest,所以你的代码让我更加详细地了解了它。谢谢@hrbrmstr。不过还有一个问题,sprintf在html函数内部实际上是做什么的? - monarque13
我有大约48,000个页面需要爬取,但我注意到当遇到损坏的页面时,即“内部错误”时,爬虫会停止。处理它们的一种方法是检查每个页面并记录哪些页面已损坏,并将好的页面连接在“albums”对象中,但这很耗时间。你有什么处理损坏页面的建议吗?谢谢。 - monarque13
当页面序列包括损坏的页面时,会出现“Error in parse.response(r, parser, encoding = encoding) : server error: (500) Internal Server Error”的错误。问题是,没有办法知道哪些页面是坏的。到目前为止,我已经确定了23、28、29、34、44、86、134、165、188、252、332、350、351、377、378、531、688、758、816、818、876、886至889、937、960、961、976、1002、1054、1084、1103、1116等页面已经损坏。可能还有数百个损坏的页面。请查看 http://www.progarchives.com/album.asp?id=2347 以获取示例。 - monarque13
更新以处理答案中的服务器错误。还添加了rbindlist - hrbrmstr
你是传奇。非常感谢! - monarque13

4

不必使用xpathApply(),您可以对每个路径中的第一个节点进行子集操作并在其上调用xmlValue()。以下是我想到的代码:

library(XML)
library(RCurl) 

## define the urls and xpath queries
urls <- sprintf("http://www.progarchives.com/album.asp?id=%s", 2:10)
path <- c(album = "//h1", date = "//strong", band = "//h2")

## define a re-usable curl handle for the c-level nodes
curl <- getCurlHandle()
## allocate the result list
out <- vector("list", length(urls))

## do the work    
for(u in urls) {
    content <- getURL(u, curl = curl)
    doc <- htmlParse(content, useInternalNodes = TRUE)
    out[[u]] <- lapply(path, function(x) xmlValue(doc[x][[1]]))
    free(doc)
}

## structure the result
data.table::rbindlist(out)
#                         album                           date      band
# 1:                    FOXTROT Studio Album, released in 1972   Genesis
# 2:              NURSERY CRYME Studio Album, released in 1971   Genesis
# 3:               GENESIS LIVE         Live, released in 1973   Genesis
# 4:        A TRICK OF THE TAIL Studio Album, released in 1976   Genesis
# 5: FROM GENESIS TO REVELATION Studio Album, released in 1969   Genesis
# 6:           GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz
# 7:          GULLIBLES TRAVELS Studio Album, released in 1985 Abel Ganz
# 8:   THE DANGERS OF STRANGERS Studio Album, released in 1988 Abel Ganz
# 9:      THE DEAFENING SILENCE Studio Album, released in 1994 Abel Ganz

更新:为了处理不存在的id查询,我们可以使用RCurl::url.exists()编写一个条件来处理错误的查询。因此,下面的函数getAlbums()返回一个字符向量,其中包含获取的xml值或NA,具体取决于url的状态。当然,如果您想要更改它,可以这样做。那只是我在深夜想到的一种方法。

getAlbums <- function(url, id = numeric(), xPath = list()) {
    urls <- sprintf("%s?id=%d", url, id)
    curl <- getCurlHandle()
    out <- vector("list", length(urls))
    for(u in urls) {
        out[[u]] <- if(url.exists(u)) {
            content <- getURL(u, curl = curl)
            doc <- htmlParse(content, useInternalNodes = TRUE)
            lapply(path, function(x) xmlValue(doc[x][[1]]))
        } else {
            warning(sprintf("returning 'NA' for urls[%d] ", id[urls == u]))
            structure(as.list(path[NA]), names = names(path))
        }
        if(exists("doc")) free(doc)
    }
    data.table::rbindlist(out)
}

url <- "http://www.progarchives.com/album.asp"
id <- c(9:10, 23, 28, 29, 30)
path <- c(album = "//h1", date = "//strong", band = "//h2")
getAlbums(url, id, path)
#                       album                           date         band
# 1: THE DANGERS OF STRANGERS Studio Album, released in 1988    Abel Ganz
# 2:    THE DEAFENING SILENCE Studio Album, released in 1994    Abel Ganz
# 3:                       NA                             NA           NA
# 4:                       NA                             NA           NA
# 5:                       NA                             NA           NA
# 6:             AD INFINITUM Studio Album, released in 1998 Ad Infinitum
# 
# Warning messages:
# 1: In albums(url, id, path) : returning 'NA' for urls[23] 
# 2: In albums(url, id, path) : returning 'NA' for urls[28] 
# 3: In albums(url, id, path) : returning 'NA' for urls[29]  

@ Richard Scriven。谢谢!这个工作得很好,只是我遇到了与上面相同的问题,即链接失效。 - monarque13

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接