从行中选择第i高的值并分配到新列的最快方法

Question

从行中选择第i高的值并分配到新列的最快方法

3

我正在寻找一种方法，将一个新列添加到现有的数据框/数据表中，该列是每个单独行中第i高的值。例如，如果我想要第4高的值，则新列将包含第一行的1.9。

data <- data.frame(a = c("a","a","b","b","c","a"),
                   peak1 = c(1.1,2.5,2.4,2.1,2.5,2.6),
                   peak2 = c(1.2,2.5,2.4,2.1,2.5,2.6),
                   peak3 = c(1.3,2.5,2.4,2.1,2.5,2.6),
                   peak4 = c(1.4,2.5,2.5,2.1,2.5,2.6),
                   peak5 = c(1.5,2.5,2.46,2.1,2.5,2.6),
                   peak6 = c(1.6,2.5,2.4,2.1,2.5,2.6),
                   peak7 = c(1.7,2.5,2.4,2.1,2.5,2.0),
                   peak8 = c(1.8,2.5,2.4,2.1,2.5,2.1),
                   peak9 = c(1.9,2.2,2.4,2.1,2.5,2.2),
                   peak10 = c(2,2.5,2.4,2.1,2.5,2.3),
                   peak11 = c(2.1,2.5,2.4,2.1,2.5,2.4),
                   peak12 = c(2.2,2.5,2.4,2.99,3,2.5))

我尝试添加索引列并使用lapply函数选择值，但它在真实数据集中（约300万条记录）每个单元格返回一个列表，并且运行非常缓慢。理想情况下，我正在寻找解决方案，在几秒钟内解决此问题，因为这在闪亮的应用程序中运行。

data$index <- lapply(split(data[,c(-1)],seq(nrow(data))),FUN = order, decreasing = TRUE)
rank <- 4
data$result <- lapply(1:nrow(data), function(row) data[row, data$test[[row]][rank]+1])

- Robert K

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martin Morgan · Accepted Answer

我已更新我的答案，提供了三个解决方案；回顾起来，fun2() 是最好的（最快、最稳健、易于理解）答案。

有许多 StackOverflow 帖子用于查找第 n 高的值，例如，https://dev59.com/RHE95IYBdhLWcg3wGqIH#2453619。这里有一个实现该解决方案的函数：

nth <- function(x, nth_largest) {
    n <- length(x) - (nth_largest - 1L)
    sort(x, partial=n)[n]
}

将此应用于数据框中的每一行（数字行）。

data$nth <- apply(data[,-1], 1, nth, nth_largest = 4)

我创建了一个大型数据集。

for (i in 1:20) data = rbind(data, data)

然后进行了一些基本的时间测量

> system.time(apply(head(data[,-1], 1000), 1, nth, 4))
   user  system elapsed
  0.012   0.000   0.012
> system.time(apply(head(data[,-1], 10000), 1, nth, 4))
   user  system elapsed
  0.150   0.005   0.155
> system.time(apply(head(data[,-1], 100000), 1, nth, 4))
   user  system elapsed
  1.274   0.005   1.279
> system.time(apply(head(data[,-1], 1000000), 1, nth, 4))
   user  system elapsed
 14.847   0.095  14.943

因此，它与行数呈线性比例扩展（这并不令人惊讶...），大约每一百万行需要15秒。

为了比较，我编写了以下解决方案

fun0 <-
    function(df, nth_largest)
{
    n <- ncol(df) - (nth_largest - 1L)
    nth <- function(x)
        sort(x, partial=n)[n]
    apply(df, 1, nth)
}

用作fun0(data[,-1], 4)。

另一种策略是从数值数据中创建矩阵。

m <- as.matrix(data[,-1])

然后对整个矩阵进行排序，将值的行索引按顺序排列

o <- order(m)
i <- row(m)[o]

对于最大值、次大值......，将每行索引的最后一个值设为NA；第n个最大值即为该行索引的最后一次出现。

for (iter in seq_len(nth_largest - 1L))
    i[!duplicated(i, fromLast = TRUE)] <- NA_integer_
idx <- !is.na(i) & !duplicated(i, fromLast = TRUE)

相应的值是m[o[idx]]，按行顺序排列

m[o[idx]][order(i[idx])]

因此，另一种解决方案是：

fun1 <-
    function(df, nth_largest)
{
    m <- as.matrix(df)
    o <- order(m)
    i <- row(m)[o]

    for (idx in seq_len(nth_largest - 1L))
        i[!duplicated(i, fromLast = TRUE)] <- NA_integer_
    idx <- !is.na(i) & !duplicated(i, fromLast = TRUE)

    m[o[idx]][order(i[idx])]
}

我们有

> system.time(res0 <- fun0(head(data[,-1], 1000000), 4))
   user  system elapsed 
 17.604   0.075  17.680 
> system.time(res1 <- fun1(head(data[,-1], 1000000), 4))
   user  system elapsed 
  3.036   0.393   3.429 
> identical(unname(res0), res1)
[1] TRUE

一般来说，当 nth_largest 不太大时，fun1() 看起来会更快。

对于 fun2()，请按行和值对原始数据进行排序，并仅保留相关索引。

fun2 <-
    function(df, nth_largest)
{
    m <- as.matrix(df)
    o <- order(row(m), m)
    idx <- seq(ncol(m) - (nth_largest - 1), by = ncol(m), length.out = nrow(m))
    m[o[idx]]
}

随着

> system.time(res1 <- fun1(head(data[, -1], 1000000), 4))
   user  system elapsed 
  2.948   0.406   3.355 
> system.time(res2 <- fun2(head(data[, -1], 1000000), 4))
   user  system elapsed 
  0.316   0.062   0.379 
> identical(res1, res2)
[1] TRUE

在完整数据集上对fun2()进行剖析

> dim(data)
[1] 6291456      13
> Rprof(); res2 <- fun2(data[, -1], 4); Rprof(NULL); summaryRprof()
$by.self
              self.time self.pct total.time total.pct
"order"            1.50    63.56       1.84     77.97
"unlist"           0.36    15.25       0.36     15.25
"row"              0.34    14.41       0.34     14.41
"fun2"             0.10     4.24       2.36    100.00
"seq.default"      0.06     2.54       0.06      2.54
...

数据显示大部分时间都花费在order()函数上；我不完全确定多因素排序的order()函数是如何实现的，但它可能具有基数排序的复杂度。无论如何，它非常快速！