将向量传递给自定义函数以使用dplyr::mutate

3
我有以下函数,它允许我从维基百科链接中获取内容(确切的内容与此问题无关)。
getPageContent <- function(url) {

        library(rvest)
        library(magrittr)

        pc <- html(url) %>% 
                html_node("#mw-content-text") %>% 
                # strip tags
                html_text() %>%
                # concatenate vector of texts into one string
                paste(collapse = "")

        pc
}

使用特定的URL时,此功能有效。

getPageContent("https://en.wikipedia.org/wiki/Balance_(game_design)")

[1] "In game design, balance is the concept and the practice of tuning a game's rules, usually with the goal of preventing any of its component systems from being ineffective or otherwise undesirable when compared to their peers. An unbalanced system represents wasted development resources at the very least, and at worst can undermine the game's entire ruleset by making impo (...)

然而,如果我想将函数传递给dplyr以获取多个页面的内容,会出现错误:

example <- data.frame(url = c("https://en.wikipedia.org/wiki/Balance_(game_design)",
                              "https://en.wikipedia.org/wiki/Koncerthuset",
                              "https://en.wikipedia.org/wiki/Tifama_chera",
                              "https://en.wikipedia.org/wiki/Difference_theory"),
                      stringsAsFactors = FALSE
                      )

library(dplyr)
example <- mutate(example, content = getPageContent(url))

Error: length(url) == 1 ist nicht TRUE
In addition: Warning message:
In mutate_impl(.data, dots) :
  the condition has length > 1 and only the first element will be used

从错误信息来看,我认为问题出在getPageContent无法处理URL向量上,但我不知道如何解决。

++++

编辑:两种提出的解决方案-1)使用rowwise()和2)使用sapply()都很好。通过模拟10篇随机WP文章,第二种方法快25%:

> system.time(
+         example <- example %>% 
+                 rowwise() %>% 
+                 mutate(content = getPageContent(url)) 
+ )
       User      System verstrichen 
       0.39        0.14        1.21 
> 
> 
> system.time(
+         example$content <- unlist(lapply(example$url, getPageContent))
+ )
       User      System verstrichen 
       0.49        0.11        0.90 
2个回答

10

您可以使用rowwise()函数,它会起作用。

 res <- example %>% 
             rowwise() %>% 
             mutate(content=getPageContent(url))

2

不要试图将字符串向量传递给寻找单个字符串的函数,为什么不在URL向量上使用lapply()

urls = c("https://en.wikipedia.org/wiki/Balance_(game_design)",
         "https://en.wikipedia.org/wiki/Koncerthuset",
         "https://en.wikipedia.org/wiki/Tifama_chera",
         "https://en.wikipedia.org/wiki/Difference_theory")

然后:

content <- lapply(urls, getPageContent)

如果你想要返回一个列表,可以使用lapply()。或者,如果你的url已经在数据框中,并且你想要将内容作为新列添加到其中,请使用sapply(),它返回一个向量而不是列表:

example$contents <- sapply(example$url, getPageContent)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接