R-如何加速数组维度遍历的循环

Question

R-如何加速数组维度遍历的循环

rperformancefor-loopmultidimensional-array

3

我正在处理一个具有尺寸的数组。

[1] 290 259  55   4

针对最后三个维度的每次重复，我需要对第一个维度的290个元素进行滚动平均处理，将元素数量减少到289。最后，我需要创建一个包含更新值的数据框。

以下代码可以实现我需要的功能，但运行时间非常长（实际上，我不得不在结束之前中断它）。

library(zoo)

# Generate random data with same dimensions as mine
my.array <- array(1:16524200, dim=c(290,259,55,4))

# Get dimension sizes
dim2 <- dim(my.array)[2]
dim3 <- dim(my.array)[3]
dim4 <- dim(my.array)[4]

# Pre-allocate data frame to be used within the loop
df2 <- data.frame()

# Loop over dimensions
for (i in 1:dim4) {
  for (j in 1:dim3) {
    for (k in 1:dim2) {

      # Take rolling average
      u <- rollapply(my.array[,k,j,i], 2, mean)

      # Assemble data frame
      df1 <- data.frame(time=i, level=j, lat=k, wind=u)
      df2 <- rbind(df2, df1)

    }
  }
}
# Very slow, and uses only one machine core

我觉得通过使用向量化或某种并行处理方式，可能可以提高此代码的处理时间，但我无法找到解决方法。

有什么建议可以使这段代码更有效率吗？

- thiagoveloso

3

不要迭代构建数据框。每次调用 rbind 时，它都会将 整个数据框 复制到一个新的对象中并覆盖 df2。这可能对几十个数据点有效，但（正如您所看到的）它不具有可扩展性。 - r2evans

@r2evans，这很有道理，但是...有什么替代方案吗？ - thiagoveloso

1

通常情况下，something <- lapply(list_of_stuff, somefunc) 然后 do.call(rbind, something)（尽管这个问题需要更多的内容）。 - r2evans

3个回答

5

一开始，你正在经历R的地狱第二层（https://www.burns-stat.com/pages/Tutor/R_inferno.pdf）：对象增长。每次调用 rbind 时，它都会复制整个帧，进行r绑定，然后将该完整副本覆盖原始变量名。因此，虽然在前几十次可能没有明显的减速，但在100次左右会稍微减慢一些......而你正在执行56,980次。

通常最好将事物处理为 list，然后在整个列表上执行一次 rbind，例如 do.call(rbind, list_of_frames)。尽管如此，你仍可能面临处理可能较困难的挑战......幸运的是，zoo 对于窗口操作非常高效，而这个操作并不是难以实现的。

我将在一个显著缩小的问题集上演示（因为我认为我们是否看16M或1.5M次迭代并不重要）。

my.array <- array(1:1502200, dim=c(290,259,5,4))
eg <- do.call(expand.grid, lapply(dim(my.array)[-1], seq_len))
dim(eg)
# [1] 5180    3
head(eg)
#   Var1 Var2 Var3
# 1    1    1    1
# 2    2    1    1
# 3    3    1    1
# 4    4    1    1
# 5    5    1    1
# 6    6    1    1

system.time({
  list_of_frames <- Map(function(i,j,k) {
    u <- zoo::rollapply(my.array[,i,j,k], 2, mean)
    data.frame(i, j, k, wind = u)
  }, eg[[1]], eg[[2]], eg[[3]])
})
#    user  system elapsed 
#    5.79    0.00    5.80 
head(list_of_frames[[5]])
#   i j k   wind
# 1 5 1 1 1161.5
# 2 5 1 1 1162.5
# 3 5 1 1 1163.5
# 4 5 1 1 1164.5
# 5 5 1 1 1165.5
# 6 5 1 1 1166.5

system.time({
  out <- do.call(rbind, list_of_frames)
})
#    user  system elapsed 
#    0.50    0.03    0.53 
nrow(out)
# [1] 1497020
rbind(head(out), tail(out))
#           i j k      wind
# 1         1 1 1       1.5
# 2         1 1 1       2.5
# 3         1 1 1       3.5
# 4         1 1 1       4.5
# 5         1 1 1       5.5
# 6         1 1 1       6.5
# 1497015 259 5 4 1502194.5
# 1497016 259 5 4 1502195.5
# 1497017 259 5 4 1502196.5
# 1497018 259 5 4 1502197.5
# 1497019 259 5 4 1502198.5
# 1497020 259 5 4 1502199.5

解释：

do.call(expand.grid, ...) is creating a frame of all the i,j,k combinations you need, dynamically on the dimensions of your array.
Map(f, is, js, ks) runs the function f with the 1st argument of each of is, js, and ks (notional for this bullet), so Map looks something like:
```
f(is[1], js[1], ks[1])
f(is[2], js[2], ks[2])
f(is[3], js[3], ks[3])
# ...
```
then we combine them in one call using do.call(rbind, ...). We really have to use do.call here because this call is analogous to
```
rbind(list_of_frames[[1]], list_of_frames[[2]], ..., list_of_frames[[5180]])
```
(over to you if you'd prefer to write out this version).

- r2evans

4

在使用data.table计算滚动平均之前，另一个选项是先将多维数组展平。

library(data.table)
system.time({
    ans <- setDT(as.data.frame.table(my.array))[
        , .(wind=((Freq + shift(Freq)) / 2)[-1L]), 
        .(time=Var4, level=Var3, lat=Var2)]
    cols <- c("time", "level", "lat")
    ans[, (cols) := lapply(.SD, function(x) match(x, unique(x))), .SDcols=cols]
})
ans

输出：

          time level lat       wind
       1:    1     1   1        1.5
       2:    1     1   1        2.5
       3:    1     1   1        3.5
       4:    1     1   1        4.5
       5:    1     1   1        5.5
      ---                          
16467216:    4    55 259 16524195.5
16467217:    4    55 259 16524196.5
16467218:    4    55 259 16524197.5
16467219:    4    55 259 16524198.5
16467220:    4    55 259 16524199.5

时间：

   user  system elapsed 
   4.90    1.16    5.66

并进行比较：

library(zoo)
system.time({
    as.data.frame.table(apply(my.array, c(2,3,4), rollmean, 2))  
})
#   user  system elapsed 
#  21.89    0.63   22.51

- chinsoon12

1

谢谢，我已经忘记了data.table有多快了！我将其标记为正式答案，因为我的数据可能比我所给出的示例大多达到10倍以上，使用这个答案可以节省我很多时间。 - thiagoveloso

我正在努力将结果数据表转换为一个维度为289 259 55 4的数组（类似于原始数据表）。你有什么提示吗？ - thiagoveloso

我猜它是使用数组？或许可以发另一个问题？我现在没有电脑。 - chinsoon12

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ritchie Sacramento · Accepted Answer

apply() 可以在任意维度上操作，因此您可以使用以下代码更快地获得相同的结果，将其包装在 as.data.frame.table() 中以有效地将数组输出转换为数据框：

library(zoo)
df <- as.data.frame.table(apply(my.array, c(2,3,4), rollmean, 2))

这并非必需，但可以整理一下以匹配您的原始输出：

idx <- sapply(df, is.factor)
df[idx] <- sapply(df[idx], as.integer)

df <- setNames(df[c(4,3,2,5)], c("time", "level", "lat", "wind"))

检查结果是否相同：

identical(df2, df)
[1] TRUE