可以使用URL中的最后一个参数
&start=
来逐页迭代搜索结果页面。每页呈现25个项目,所以页面序列是25、50、75、100...。
我们将获取前5页的结果,总共125个交易。由于第一页从
&start=0
开始,因此我们分配一个向量
startRows
来表示每个页面的起始行。
然后,我们使用该向量驱动
lapply()
和一个匿名函数,该函数读取数据并操作以删除每个数据页面的标题行。
library(rvest)
library(dplyr)
start_date <- "1947-01-01"
end_date <- "2020-12-28"
css_selector <- ".datatable"
startRows <- c(0,25,50,75,100)
pages <- lapply(startRows,function(x){
url <- paste0("https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=&BeginDate=", start_date,"&EndDate=", end_date,
"&ILChkBx=yes&InjuriesChkBx=yes&PersonalChkBx=yes&Submit=Search&start=",x)
webpage <- xml2::read_html(url)
data <- webpage %>%
rvest::html_node(css = css_selector) %>%
rvest::html_table() %>%
as_tibble()
colnames(data) = data[1,]
data[-1, ]
})
data <- do.call(rbind,pages)
head(data,n=10)
...并输出:
> head(data,n=10)
# A tibble: 10 x 5
Date Team Acquired Relinquished Notes
<chr> <chr> <chr> <chr> <chr>
1 1947-08… Bombers … "" "• Jack Underman" fractured legs (in auto accide…
2 1948-02… Bullets … "• Harry Jeannette… "" broken rib (DTD) (date approxi…
3 1949-03… Capitols "" "• Horace McKinney /… personal reasons (DTD)
4 1949-11… Capitols "" "• Fred Scolari" fractured right cheekbone (out…
5 1949-12… Knicks "" "• Vince Boryla" mumps (out ~2 weeks)
6 1950-01… Knicks "• Vince Boryla" "" returned to lineup (date appro…
7 1950-10… Knicks "" "• Goebel Ritter / T… bruised ligaments in left ankl…
8 1950-11… Warriors "" "• Andy Phillip" lacerated foot (DTD)
9 1950-12… Celtics "" "• Andy Duncan (a)" fractured kneecap (out indefin…
10 1951-12… Bullets "" "• Don Barksdale" placed on IL
>
验证结果
我们可以通过打印每个页面的第一行和最后一行来验证结果,从第一页上的最后一个观察值开始。
data[c(25,26,50,51,75,76,100,101,125),]
...并且输出结果与手动在网站上浏览搜索结果的页面1-5中呈现的内容相匹配。
> data[c(25,26,50,51,75,76,100,101,125),]
Date Team Acquired Relinquished Notes
<chr> <chr> <chr> <chr> <chr>
1 1960-01-… Celtics "" "• Bill Sharma… sprained Achilles tendon (date approxima…
2 1960-01-… Celtics "" "• Jim Loscuto… sore back and legs (out indefinitely) (d…
3 1964-10-… Knicks "• Art Heyma… "" returned to lineup
4 1964-12-… Hawks "• Bob Petti… "" returned to lineup (date approximate)
5 1968-11-… Nets (ABA) "" "• Levern Tart" fractured right cheekbone (out indefinit…
6 1968-12-… Pipers (AB… "" "• Jim Harding" took leave of absence as head coach for …
7 1970-08-… Lakers "" "• Earnie Kill… dislocated left foot (out indefinitely)
8 1970-10-… Lakers "" "• Elgin Baylo… torn Achilles tendon (out for season) (d…
9 1972-01-… Cavaliers "• Austin Ca… "" returned to lineup
如果我们查看表格中的最后一页,我们会发现页面系列的最大值是
&start=61475
。生成整个页面序列(2460页,与网站上搜索结果中列出的页面数相匹配)的R代码如下:
pages <- c(0,seq(from=25,to=61475,by=25))
...和输出:
> head(pages)
[1] 0 25 50 75 100 125
> tail(pages)
[1] 61350 61375 61400 61425 61450 61475