使用Rvest从多个页面抓取文本、表格并将其组合

3

我有一个情况,想要在不同的网址上抓取多个表格。我已经成功地抓取了一个页面,但是当我尝试跨页面抓取并将表格堆叠为数据框/列表时,我的函数失败了。

library(rvest)
library(tidyverse)
library(purrr)

   index <-225:227
          urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", index)
          
         
          get_gram <- function(url){
               urls %>%
                    read_html() %>%
                    html_nodes(xpath = '//*[@id="block-zircon-content"]/a[2]') %>%
                    html_text() -> temp
               urls %>% 
                    read_html() %>%
                    html_nodes(xpath = '//*[@id="block-zircon-content"]/table') %>% 
                    html_table() %>% 
                    as.data.frame() %>% add_column(newcol=str_c(temp))
          }
#results <- map_df(urls,get_gram) Have commented this out, but this is what i 
# used to get the table when the index just had one element and it worked.

results <- list()
results[[i]] <- map_df(urls,get_gram)

我觉得我在堆叠map_df输出的步骤上有些踌躇不前,提前感谢你的帮助!


жҸҗзӨәпјҡдёҚиҰҒйҮҚеӨҚйҳ…иҜ»еҗҢдёҖдёӘзҪ‘йЎөпјҒеңЁеҮҪж•°ејҖе§Ӣж—¶дҪҝз”Ё page <- url %>% read_html()пјҢ然еҗҺи§Јжһҗ "page" иҺ·еҸ–жүҖйңҖзҡ„дҝЎжҒҜгҖӮ - Dave2e
2个回答

2
你正在将 url 传递给函数,并在函数体中使用 urls。尝试使用以下版本:
library(rvest)
library(dplyr)

index <-225:227
urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", index)

get_gram <- function(url){
  webpage <- url %>%  read_html() 
  webpage %>%
    html_nodes(xpath = '//*[@id="block-zircon-content"]/a[2]') %>%
    html_text() -> temp
  webpage %>%
    html_nodes(xpath = '//*[@id="block-zircon-content"]/table') %>% 
    html_table() %>% 
    as.data.frame() %>% add_column(newcol=temp)
}
result <- purrr::map_df(urls,get_gram)


亲爱的Ronak,有几个空表格,所以代码运行失败了。有没有什么解决办法?我尝试使用"possibly"函数如下: result <- purrr::map_df(urls, possibly(get_gram, otherwise = NULL)) - Rajesh Patrick Que

2
考虑这种方法。我们只需要使用html_node,因为您的代码表明每个页面只有一个要抓取的表格。
library(tidyverse)
library(rvest)

get_title <- . %>% html_node(xpath = '//*[@id="block-zircon-content"]/a[2]') %>% html_text()
get_table <- . %>% html_node(xpath = '//*[@id="block-zircon-content"]/table') %>% html_table()

urls <- paste0("https://lsgkerala.gov.in/en/lbelection/electdmemberdet/2010/", 225:227)

tibble(urls) %>% 
  mutate(
    page = map(urls, read_html), 
    newcol = map_chr(page, get_title), 
    data = map(page, get_table), 
    page = NULL, urls = NULL
  ) %>% 
  unnest(data)

输出

# A tibble: 52 x 7
   newcol                                           `Ward No.` `Ward Name`      `Elected Members` Role      Party  Reservation
   <chr>                                                 <int> <chr>            <chr>             <chr>     <chr>  <chr>      
 1 Thiruvananthapuram - Chemmaruthy Grama Panchayat          1 VANDIPPURA       BABY P            Member    CPI(M) Woman      
 2 Thiruvananthapuram - Chemmaruthy Grama Panchayat          2 PALAYAMKUNNU     SREELATHA D       Member    INC    Woman      
 3 Thiruvananthapuram - Chemmaruthy Grama Panchayat          3 KOVOOR           KAVITHA V         Member    INC    Woman      
 4 Thiruvananthapuram - Chemmaruthy Grama Panchayat          4 SIVAPURAM        ANIL. V           Member    INC    General    
 5 Thiruvananthapuram - Chemmaruthy Grama Panchayat          5 MUTHANA          JAYALEKSHMI S     Member    INC    Woman      
 6 Thiruvananthapuram - Chemmaruthy Grama Panchayat          6 MAVINMOODU       S SASIKALA NATH   Member    CPI(M) Woman      
 7 Thiruvananthapuram - Chemmaruthy Grama Panchayat          7 NJEKKADU         P.MANILAL         Member    INC    General    
 8 Thiruvananthapuram - Chemmaruthy Grama Panchayat          8 CHEMMARUTHY      SASEENDRA         President INC    Woman      
 9 Thiruvananthapuram - Chemmaruthy Grama Panchayat          9 PANCHAYAT OFFICE PRASANTH PANAYARA Member    INC    General    
10 Thiruvananthapuram - Chemmaruthy Grama Panchayat         10 VALIYAVILA       SANJAYAN S        Member    INC    General    
# ... with 42 more rows

亲爱的ekoam,我遇到了一个空表格,代码运行失败了。到目前为止,我尝试使用可能的解决方法都失败了。 - Rajesh Patrick Que

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接