加速函数：在计算平均值之前检查NA数量

Question

加速函数：在计算平均值之前检查NA数量

ralgorithmvectorizationapplyna

3

下面的函数计算向量的平均值。然而，它首先检查向量中存在的NA的比例，如果超过给定的阈值，则返回NA而不是平均值。

我的问题是，当前的实现相当低效。它比简单运行`mean(vec, na.rm=TRUE)`慢7倍以上。

我尝试了使用`na.omit`的替代方法，但那甚至更慢。

鉴于我的数据规模，执行单个`lapply`需要超过40分钟。

有什么建议可以更快地完成同样的任务吗？

更新 - 关于@thelatemail的解决方案和@Arun的评论：

我正在对数百个组执行此函数，每个组的大小都不同。这个问题最初提供的示例数据只是为了方便创建人工数据而提供的整洁数据框。

替代样本数据以避免混淆。

# Sample Data 
# ------------
  set.seed(1)
  # slightly different sizes for each group
  N1 <- 5e3
  N2 <- N1 + as.integer(rnorm(1, 0, 100))

  # One group has only a moderate amount of NA's
  SAMP1 <- rnorm(N1)
  SAMP1[sample(N1, .25 * N1, FALSE)] <- NA  # add in NA's

  # Another group has many NA's
  SAMP2 <- rnorm(N2)
  SAMP2[sample(N2, .95 * N2, FALSE)] <- NA  # add in large number of NA's

  # put them all in a list
  SAMP.NEW <- list(SAMP1, SAMP2)

  # keep it clean
  rm(SAMP1, SAMP2)

# Execute 
# -------    
  lapply(SAMP.NEW, meanIfThresh)

原始样本数据、函数等

# Sample Data 
# ------------
  set.seed(1)
  rows <- 20000  # actual data has more than 7M rows
  cols <-  1000  

  SAMP <- replicate(cols, rnorm(rows))
  SAMP[sample(length(SAMP), .25 * length(SAMP), FALSE)] <- NA  # add in NA's

  # Select 5 random rows, and have them be 90% NA
  tooSparse <- sample(rows, 5)
  for (r in tooSparse)
    SAMP[r, sample(cols, cols * .9, FALSE)] <- NA

# Function 
# ------------
    meanIfThresh <- function(vec, thresh=12/15) { 
      # Calculates the mean of vec, however, 
      #   if the number of non-NA values of vec is less than thresh, returns NA 

      # thresh : represents how much data must be PRSENT. 
      #          ie, if thresh is 80%, then there must be at least 


      len <- length(vec)

      if( (sum(is.na(vec)) / len) > thresh)
        return(NA_real_)
      # if the proportion of NA's is greater than the threshold, return NA
      # example:  if I'm looking at 14 days, and I have 12 NA's,
      #            my proportion is 85.7 % = (12 / 14)
      #            default thesh is  80.0 % = (12 / 15)
      #            Thus, 12 NAs in a group of 14 would be rejected


    # else, calculate the mean, removing NA's
    return(mean(vec, na.rm=TRUE))       
  }


  # Execute
  # -----------------
  apply(SAMP, 1, meanIfThresh)

  # Compare with `mean`
  #----------------
  plain    <- apply(SAMP, 1, mean, na.rm=TRUE)
  modified <- apply(SAMP, 1, meanIfThresh)

  # obviously different
  identical(plain, modified)
  plain[tooSparse]
  modified[tooSparse]


  microbenchmark( "meanIfThresh"   = apply(SAMP, 1, meanIfThresh)
                , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
                , times = 15L)

 #  With the actual data, the penalty is sevenfold
 #  Unit: seconds
 #           expr      min       lq   median       uq      max neval
 #   meanIfThresh 1.658600 1.677472 1.690460 1.751913 2.110871    15
 # mean (regular) 1.422478 1.485320 1.503468 1.532175 1.547450    15

- Ricardo Saporta

2

这行代码 SAMP[sample(SAMP, .25 * length(SAMP), FALSE)] <- NA 在 R3.0.1 中对我来说会抛出一个错误，提示 only 0's may be mixed with negative subscripts。 - thelatemail

也许可以改为 SAMP[sample(seq_along(SAMP), .25 * length(SAMP))] <- NA？ - thelatemail

谢谢，应该是 SAMP[sample(length(SAMP), .25 * length(SAMP), FALSE)] <- NA（已在编辑中更正） - Ricardo Saporta

2个回答

1

你的函数中是否需要两次遍历向量vec？如果你可以先存储NA，也许可以加快计算速度：

meanIfThresh2 <- function(vec, thresh=12/15) { 

  len <- length(vec)
  nas <- is.na(vec)
  Nna <- sum(nas)
  if( (Nna / len) > thresh)
    return(NA_real_)

  return(sum(vec[!nas])/(len-Nna))
}

编辑：我进行了类似的基准测试，以查看此更改的影响：

> microbenchmark(  "meanIfThresh"   = apply(SAMP, 1, meanIfThresh)
+                 , "meanIfThresh2"   = apply(SAMP, 1, meanIfThresh2)
+                 , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
+                 , times = 15L)
Unit: seconds
           expr      min       lq   median       uq      max neval
   meanIfThresh 2.009858 2.156104 2.158372 2.166092 2.192493    15
  meanIfThresh2 1.825470 1.828273 1.829424 1.834407 1.872028    15
 mean (regular) 1.868568 1.882526 1.889852 1.893564 1.907495    15

- Daniel Fischer

你应该做同样的基准测试。我认为这不会改变任何事情。在我看来，@thelatemail是正确的。 - Arun

好的，我在我的帖子中添加了基准测试 - 看起来存储NA会带来一些速度。至少它似乎和常规的平均函数一样快。 - Daniel Fischer

我本应该说，这不会改变太多。请检查latemail的答案中的编辑。 - Arun

是的，您使用 rowMeans 的建议带来了巨大的改变！但在此之前，我认为解决方案必须更慢，因为它首先执行常规平均值，然后再进行 is.na 步骤。我不知道 rowMeans 是如此快... - Daniel Fischer

@Daniel，这确实有帮助！它减少了调用“mean”的额外开销。 - Ricardo Saporta

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- thelatemail · Accepted Answer

你能否在之后替换高NA行的平均值，就像这样：

# changed `result <- apply(SAMP,1,mean,na.rm=TRUE)`
result <- rowMeans(SAMP, na.rm=TRUE)
NArows <- rowSums(is.na(SAMP))/ncol(SAMP) > 0.8
result[NArows] <- NA

一些基准测试：

Ricardo <- function(vec, thresh=12/15) {
    len <- length(vec)
    if( (sum(is.na(vec)) / len) > thresh)
        return(NA_real_)
    return(mean(vec, na.rm=TRUE))       
}

DanielFischer <- function(vec, thresh=12/15) {

    len <- length(vec)
    nas <- is.na(vec)
    Nna <- sum(nas)
    if( (Nna / len) > thresh)
        return(NA_real_)
    return(sum(vec[!nas])/(len-Nna))
}

thelatemail <- function(mat) {
    result <- rowMeans(mat, na.rm=TRUE)
    NArows <- rowSums(is.na(mat))/ncol(mat) > 0.8
    result[NArows] <- NA
    result
}

require(microbenchmark)
microbenchmark(m1 <- apply(SAMP, 1, Ricardo), 
               m2 <- apply(SAMP, 1, DanielFischer), 
               m3 <- thelatemail(SAMP), times = 5L)

Unit: milliseconds
                                expr       min        lq    median        uq       max neval
       m1 <- apply(SAMP, 1, Ricardo) 2923.7260 2944.2599 3066.8204 3090.8127 3105.4283     5
 m2 <- apply(SAMP, 1, DanielFischer) 2643.4883 2683.1034 2755.7032 2799.5155 3089.6015     5
                m3 <- latemail(SAMP)  337.1862  340.6339  371.6148  376.5517  383.4436     5

all.equal(m1, m2) # TRUE
all.equal(m1, m3) # TRUE