我有一个数据框,看起来像这样:
country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", "http://en.wikipedia.org/wiki/Canada",
"http://en.wikipedia.org/wiki/Japan", "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)
country link
1 Canada http://en.wikipedia.org/wiki/United_States
2 US http://en.wikipedia.org/wiki/Canada
3 Japan http://en.wikipedia.org/wiki/Japan
4 China http://en.wikipedia.org/wiki/China
使用
rvest
,我想要爬取每个URL的目录并将它们绑定到单个输出中。以下代码用于提取一个URL的目录:
library(rvest)
toc <- html(url) %>%
html_nodes(".toctext") %>%
html_text()
期望输出:
country toc
US Etymology
History
Native American and European contact
Settlements
...
Canada Etymology
History
Aboriginal peoples
European colonization
...etc