我有以下函数,它允许我从维基百科链接中获取内容(确切的内容与此问题无关)。
getPageContent <- function(url) {
library(rvest)
library(magrittr)
pc <- html(url) %>%
html_node("#mw-content-text") %>%
# strip tags
html_text() %>%
# concatenate vector of texts into one string
paste(collapse = "")
pc
}
使用特定的URL时,此功能有效。
getPageContent("https://en.wikipedia.org/wiki/Balance_(game_design)")
[1] "In game design, balance is the concept and the practice of tuning a game's rules, usually with the goal of preventing any of its component systems from being ineffective or otherwise undesirable when compared to their peers. An unbalanced system represents wasted development resources at the very least, and at worst can undermine the game's entire ruleset by making impo (...)
然而,如果我想将函数传递给dplyr
以获取多个页面的内容,会出现错误:
example <- data.frame(url = c("https://en.wikipedia.org/wiki/Balance_(game_design)",
"https://en.wikipedia.org/wiki/Koncerthuset",
"https://en.wikipedia.org/wiki/Tifama_chera",
"https://en.wikipedia.org/wiki/Difference_theory"),
stringsAsFactors = FALSE
)
library(dplyr)
example <- mutate(example, content = getPageContent(url))
Error: length(url) == 1 ist nicht TRUE
In addition: Warning message:
In mutate_impl(.data, dots) :
the condition has length > 1 and only the first element will be used
从错误信息来看,我认为问题出在getPageContent
无法处理URL向量上,但我不知道如何解决。
++++
编辑:两种提出的解决方案-1)使用rowwise()
和2)使用sapply()
都很好。通过模拟10篇随机WP文章,第二种方法快25%:
> system.time(
+ example <- example %>%
+ rowwise() %>%
+ mutate(content = getPageContent(url))
+ )
User System verstrichen
0.39 0.14 1.21
>
>
> system.time(
+ example$content <- unlist(lapply(example$url, getPageContent))
+ )
User System verstrichen
0.49 0.11 0.90