R: apply与do.call的区别

Question

R: apply与do.call的区别

7

我刚刚阅读了@David Arenburg的个人资料，并发现了一些关于如何培养良好的R编程技能和习惯的有用技巧，其中一个特别引起了我的注意。我一直认为在R中使用apply函数是处理数据框架的基石，但他写道：

如果你正在处理数据框架，请忘记有一个叫做apply的函数——无论你做什么——都不要使用它。特别是在第1个边距上（这个函数唯一的好用例是在矩阵列上操作-第2个边距）。

一些好的替代方案：?do.call、?pmax/pmin、?max.col、?rowSums/rowMeans等，还有非常棒的matrixStats包（针对矩阵）、?rowsum和其他许多函数

有人能解释一下吗？为什么apply函数会被人们所反感？

- Helen

9

我实际上在谈论 apply，而不是整个 *apply 系列。apply 的主要问题在于它会将整个数据转换为矩阵，这会破坏数据（因为 matrix 无法像数据框那样存储不同的类别），从而导致意外的结果。因此，在对列进行操作时，最好使用其他的 *apply 系列函数，如 lapply 或 sapply。另一方面，由于 R 是矢量化语言，使用带有1维度参数的 apply 很慢（与 matrix 问题无关），因此我建议使用矢量化替代方法。 - David Arenburg

1

啊哈，我明白了，非常感谢你的解释！ - Helen

此外，这篇关于 *apply 系列的文章也是一个有用的阅读材料。链接 - David Arenburg

太好了！再次感谢 :) - Helen

3个回答

2

我认为作者的意思是，如果可以的话，应该使用预构建/向量化函数（因为这样更容易），并避免使用apply（因为它原则上是一个循环，需要更长时间）：

library(microbenchmark)

d <- data.frame(a = rnorm(10, 10, 1),
                b = rnorm(10, 200, 1))

# bad - loop
microbenchmark(apply(d, 1, function(x) if (x[1] < x[2]) x[1] else x[2]))

# good - vectorized but same result
microbenchmark(pmin(d[[1]], d[[2]])) # use double brackets!

# edited:
# -------
# bad: lapply
microbenchmark(data.frame(lapply(d, round, 1)))

# good: do.call faster than lapply
microbenchmark(do.call("round", list(d, digits = 1)))

# --------------
# Unit: microseconds
#                                  expr     min    lq     mean  median      uq     max neval
# do.call("round", list(d, digits = 1)) 104.422 107.1 148.3419 134.767 184.524 332.009   100
#                            expr     min       lq     mean  median      uq      max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265   100
#
#                                  expr    min      lq    mean median       uq     max neval
# do.call("round", list(d, digits = 1)) 96.389 97.5055 113.075 98.175 105.5375 730.954   100
#                            expr     min       lq     mean  median      uq      max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265   100

- r.user.05apr

所有的 apply 函数本质上都是循环吗？比如 lapply、sapply 等？ - Helen

这个回答如何解决 do.call 部分的问题？ - pogibas

编辑过的。（关于for循环；根据https://www.burns-stat.com/pages/Tutor/R_inferno.pdf，使用apply函数是隐藏循环的） - r.user.05apr

你能否将 microbenchmark 的输出添加到你的答案中？ - Tung

1

@Erosennin - 是的，apply家族是循环。考虑阅读@DavidArunberg的这个问题：https://dev59.com/yV4b5IYBdhLWcg3wiSPV。 - Parfait

1

这与R如何存储矩阵和数据框有关。正如您所知，data.frame是一个向量列表，即，data.frame中的每一列都是一个向量。作为一种向量化语言，最好操作向量，这就是为什么不建议使用带有margin 2的apply：这样做你将不会在向量上操作，相反，在每次迭代中，你将跨越不同的向量。

据我所知，使用margin 1的apply与使用do.call没有太大区别。尽管后者可能允许更多的使用灵活性。

*此信息应该出现在manuals的某个地方。

- Novice

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- G. Grothendieck · Accepted Answer

apply(DF, 1, f) converts each row of DF to a vector and then passes that vector to f. If DF were a mix of strings and numbers then the row would be converted to a character vector before passing it to f so that, for example, apply(iris, 1, function(x) sum(x[-5])) will not work even though the row iris[i, -5] contains all numeric elements. The row is converted to character string and you can't sum character strings. On the other hand apply(iris[-5], 1, sum) will work the same as rowSums(iris[-5]).
if f produces a vector the result is a matrix and not another data frame; also, the result is the transpose of what you might expect. This
```
apply(BOD, 1, identity)
```
gives the following rather than giving BOD back:
```
       [,1] [,2] [,3] [,4] [,5] [,6]
Time    1.0  2.0    3    4  5.0  7.0
demand  8.3 10.3   19   16 15.6 19.8
```
Many years ago Hadley Wickham did post iapply which is idempotent in the sense that iapply(mat, 1, identity) returns mat, rather than t(mat), where mat is a matrix. More recently with his plyr package one can write:
```
library(plyr)
ddplyr(BOD, 1, identity)
```
and get BOD back as a data frame.

另一方面，apply(BOD, 1, sum) 将会得到与 rowSums(BOD) 相同的结果，而 apply(BOD, 1, f) 可能对于那些函数 f 产生标量且没有类似于 sum / rowSums 的对应项的情况非常有用。此外，如果 f 产生一个向量，并且您不介意矩阵结果，则可以自己转置 apply 的输出，虽然这样做可能不太美观，但仍可行。