在R中高效地将数据转换为向量

Question

在R中高效地将数据转换为向量

rperformancebenchmarkingmicrobenchmark

6

有人能帮我使这段R代码更加高效吗？

我正在尝试编写一个函数，将字符串列表转换为字符串向量，或将数字列表转换为数字向量，将类型元素列表转换为特定类型的向量。

如果列表具有以下属性，我想要将它们更改为特定类型的向量：

They are homogenously typed. Every element of the list is of type 'character', or 'complex' or so on.

Each element of the list is length-one.

as_atomic <- local({

    assert_is_valid_elem <- function (elem, mode) {

        if (length(elem) != 1 || !is(elem, mode)) {
            stop("")
        }
        TRUE
    }

    function (coll, mode) {

        if (length(coll) == 0) {
            vector(mode)
        } else {
            # check that the generic vector is composed only
            # of length-one values, and each value has the correct type.

            # uses more memory that 'for', but is presumably faster.
            vapply(coll, assert_is_valid_elem, logical(1), mode = mode)

            as.vector(coll, mode = mode)
        }
    }
})

例如，

as_atomic(list(1, 2, 3), 'numeric')
as.numeric(c(1,2,3))

# this fails (mixed types)
as_atomic( list(1, 'a', 2), 'character' )
# ERROR.

# this fails (non-length one element)
as_atomic( list(1, c(2,3,4), 5), 'numeric' )
# ERROR.

# this fails (cannot convert numbers to strings)
as_atomic( list(1, 2, 3), 'character' )
# ERROR.

上面的代码可以正常工作，但是非常慢，我看不到任何优化的方法，除非改变函数的行为。函数“as_atomic”的行为非常重要；我不能切换到我熟悉的基本函数（例如unlist），因为我需要为坏列表抛出错误。

require(microbenchmark)

microbenchmark(
    as_atomic( as.list(1:1000), 'numeric'),
    vapply(1:1000, identity, integer(1)),
    unit = 'ns'
)

在我的（相当快的）机器上，基准测试频率约为40Hz，因此该函数在我的代码中几乎总是速率限制。vapply控制基准测试的频率约为1650Hz，仍然非常慢。
有没有办法显着提高此操作的效率？任何建议都将不胜感激。
如果需要任何澄清或编辑，请在下面留言。

编辑：

大家好，
很抱歉回复得很晚；我需要参加考试才能重新实现这个问题。
感谢您们提供的性能技巧。使用纯R代码，我将性能从可怕的40hz提高到了更可接受的600hz。
最大的加速来自于使用typeof或mode而不是is; 这真正加快了紧密的内部检查循环。
我可能不得不咬咬牙，用rcpp重写它以获得真正的性能。

- Róisín Grannell

为什么不要使用as.numeric(list(1,2,3))或as.character...？ - agstudy

这些函数将尝试转换混合类型的集合。它们将其他类型的元素强制转换为NA值，而不是在列表具有混合类型时抛出错误。as.numeric( list(1,2, 'a')) c(1, 2, NA) - Róisín Grannell

list(1, 'a', 2)的预期结果是什么？ - agstudy

抱歉，我现在会编辑。 - Róisín Grannell

很遗憾，unlist函数不会检查其输入的每个元素是否长度为一，也不会检查它们是否具有特定的模式。 - Róisín Grannell

3个回答

4

尝试：

as_atomic_2 <- function(x, mode) {
  if(!length(unique(vapply(x, typeof, ""))) == 1L) stop("mixed types")
  as.vector(x, mode)
}
as_atomic_2(list(1, 2, 3), 'numeric')
# [1] 1 2 3
as_atomic_2(list(1, 'a', 2), 'character')
# Error in as_atomic_2(list(1, "a", 2), "character") : mixed types
as_atomic_2(list(1, c(2,3,4), 5), 'numeric' )
# Error in as.vector(x, mode) : 
#   (list) object cannot be coerced to type 'double'

microbenchmark(
  as_atomic( as.list(1:1000), 'numeric'),
  as_atomic_2(as.list(1:1000), 'numeric'),
  vapply(1:1000, identity, integer(1)),
  unit = 'ns'
)    
# Unit: nanoseconds
#                                     expr      min       lq     median 
#    as_atomic(as.list(1:1000), "numeric") 23571781 24059432 24747115.5 
#  as_atomic_2(as.list(1:1000), "numeric")  1008945  1038749  1062153.5 
#     vapply(1:1000, identity, integer(1))   719317   762286   778376.5

- BrodieG

3

定义自己的函数来进行类型检查似乎是瓶颈。使用内置函数之一可以大大提高速度。但是，调用方式会有所改变（虽然可能可以更改）。下面的示例是我能想出的最快版本：

如上所述，使用is.numeric、is.character可以获得最大的加速。

as_atomic2 <- function(l, check_type) {
  if (!all(vapply(l, check_type, logical(1)))) stop("")
  r <- unlist(l)
  if (length(r) != length(l)) stop("")
  r
}

以下是我使用原始界面能想出来的最快解决方案：

as_atomic3 <- function(l, type) {
  if (!all(vapply(l, mode, character(length(type))) == type)) stop("")
  r <- unlist(l)
  if (length(r) != length(l)) stop("")
  r
}

基准测试与原始数据对比：

res <- microbenchmark(
    as_atomic( as.list(1:1000), 'numeric'),
    as_atomic2( as.list(1:1000), is.numeric),
    as_atomic3( as.list(1:1000), 'numeric'),
    unit = 'ns'
)
#                                    expr      min         lq     median         uq      max neval
#   as_atomic(as.list(1:1000), "numeric") 13566275 14399729.0 14793812.0 15093380.5 34037349   100
# as_atomic2(as.list(1:1000), is.numeric)   314328   325977.0   346353.5   369852.5   896991   100
#  as_atomic3(as.list(1:1000), "numeric")   856423   899942.5   967705.5  1023238.0  1598593   100

- Jan van der Laan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hadley · Accepted Answer

这个问题有两个部分：

检查输入是否有效
将列表强制转换为向量

检查有效输入

首先，我会避免使用 is()，因为它被认为是很慢的。可以这样写：

check_valid <- function (elem, mode) {
  if (length(elem) != 1) stop("Must be length 1")
  if (mode(elem) != mode) stop("Not desired type")

  TRUE
}

现在我们需要弄清循环或应用程序变体哪个更快。我们将使用最坏的情况进行基准测试，即所有输入都有效。

worst <- as.list(0:101)

library(microbenchmark)
options(digits = 3)
microbenchmark(
  `for` = for(i in seq_along(worst)) check_valid(worst[[i]], "numeric"),
  lapply = lapply(worst, check_valid, "numeric"),
  vapply = vapply(worst, check_valid, "numeric", FUN.VALUE = logical(1))
)

## Unit: microseconds
##    expr min  lq median  uq  max neval
##     for 278 293    301 318 1184   100
##  lapply 274 282    291 310 1041   100
##  vapply 273 284    288 298 1062   100

三种方法基本上是相同的。 lapply() 微微快一些，可能是因为它使用了特殊的C技巧。

将列表强制转换为向量

现在让我们看一下几种将列表强制转换为向量的方法：

change_mode <- function(x, mode) {
  mode(x) <- mode
  x
}

microbenchmark(
  change_mode = change_mode(worst, "numeric"),
  unlist = unlist(worst),
  as.vector = as.vector(worst, "numeric")
)

## Unit: microseconds
##         expr   min    lq median   uq    max neval
##  change_mode 19.13 20.83  22.36 23.9 167.51   100
##       unlist  2.42  2.75   3.11  3.3  22.58   100
##    as.vector  1.79  2.13   2.37  2.6   8.05   100

看起来您已经在使用最快的方法，总成本受检查的影响。

替代方法

另外一个想法是我们可以通过一次循环向量来稍微提高速度，而不是分别用一次检查和一次强制转换。

as_atomic_for <- function (x, mode) {
  out <- vector(mode, length(x))

  for (i in seq_along(x)) {
    check_valid(x[[i]], mode)
    out[i] <- x[[i]]
  }

  out
}
microbenchmark(
  as_atomic_for(worst, "numeric")
)

## Unit: microseconds
##                             expr min  lq median  uq  max neval
##  as_atomic_for(worst, "numeric") 497 524    557 685 1279   100

那绝对更糟。

总的来说，我认为这表明如果你想让这个函数更快，你应该尝试在Rcpp中向量化check函数。