加速函数:在计算平均值之前检查NA数量

3
下面的函数计算向量的平均值。然而,它首先检查向量中存在的NA的比例,如果超过给定的阈值,则返回NA而不是平均值。
我的问题是,当前的实现相当低效。它比简单运行`mean(vec, na.rm=TRUE)`慢7倍以上。
我尝试了使用`na.omit`的替代方法,但那甚至更慢。
鉴于我的数据规模,执行单个`lapply`需要超过40分钟。
有什么建议可以更快地完成同样的任务吗?
更新 - 关于@thelatemail的解决方案和@Arun的评论:
我正在对数百个组执行此函数,每个组的大小都不同。这个问题最初提供的示例数据只是为了方便创建人工数据而提供的整洁数据框。
替代样本数据以避免混淆。
# Sample Data 
# ------------
  set.seed(1)
  # slightly different sizes for each group
  N1 <- 5e3
  N2 <- N1 + as.integer(rnorm(1, 0, 100))

  # One group has only a moderate amount of NA's
  SAMP1 <- rnorm(N1)
  SAMP1[sample(N1, .25 * N1, FALSE)] <- NA  # add in NA's

  # Another group has many NA's
  SAMP2 <- rnorm(N2)
  SAMP2[sample(N2, .95 * N2, FALSE)] <- NA  # add in large number of NA's

  # put them all in a list
  SAMP.NEW <- list(SAMP1, SAMP2)

  # keep it clean
  rm(SAMP1, SAMP2)

# Execute 
# -------    
  lapply(SAMP.NEW, meanIfThresh)

原始样本数据、函数等

# Sample Data 
# ------------
  set.seed(1)
  rows <- 20000  # actual data has more than 7M rows
  cols <-  1000  

  SAMP <- replicate(cols, rnorm(rows))
  SAMP[sample(length(SAMP), .25 * length(SAMP), FALSE)] <- NA  # add in NA's

  # Select 5 random rows, and have them be 90% NA
  tooSparse <- sample(rows, 5)
  for (r in tooSparse)
    SAMP[r, sample(cols, cols * .9, FALSE)] <- NA

# Function 
# ------------
    meanIfThresh <- function(vec, thresh=12/15) { 
      # Calculates the mean of vec, however, 
      #   if the number of non-NA values of vec is less than thresh, returns NA 

      # thresh : represents how much data must be PRSENT. 
      #          ie, if thresh is 80%, then there must be at least 


      len <- length(vec)

      if( (sum(is.na(vec)) / len) > thresh)
        return(NA_real_)
      # if the proportion of NA's is greater than the threshold, return NA
      # example:  if I'm looking at 14 days, and I have 12 NA's,
      #            my proportion is 85.7 % = (12 / 14)
      #            default thesh is  80.0 % = (12 / 15)
      #            Thus, 12 NAs in a group of 14 would be rejected


    # else, calculate the mean, removing NA's
    return(mean(vec, na.rm=TRUE))       
  }


  # Execute
  # -----------------
  apply(SAMP, 1, meanIfThresh)

  # Compare with `mean`
  #----------------
  plain    <- apply(SAMP, 1, mean, na.rm=TRUE)
  modified <- apply(SAMP, 1, meanIfThresh)

  # obviously different
  identical(plain, modified)
  plain[tooSparse]
  modified[tooSparse]


  microbenchmark( "meanIfThresh"   = apply(SAMP, 1, meanIfThresh)
                , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
                , times = 15L)

 #  With the actual data, the penalty is sevenfold
 #  Unit: seconds
 #           expr      min       lq   median       uq      max neval
 #   meanIfThresh 1.658600 1.677472 1.690460 1.751913 2.110871    15
 # mean (regular) 1.422478 1.485320 1.503468 1.532175 1.547450    15

2
这行代码 SAMP[sample(SAMP, .25 * length(SAMP), FALSE)] <- NA 在 R3.0.1 中对我来说会抛出一个错误,提示 only 0's may be mixed with negative subscripts - thelatemail
也许可以改为 SAMP[sample(seq_along(SAMP), .25 * length(SAMP))] <- NA - thelatemail
谢谢,应该是 SAMP[sample(length(SAMP), .25 * length(SAMP), FALSE)] <- NA(已在编辑中更正) - Ricardo Saporta
2个回答

5
你能否在之后替换高NA行的平均值,就像这样:
# changed `result <- apply(SAMP,1,mean,na.rm=TRUE)`
result <- rowMeans(SAMP, na.rm=TRUE)
NArows <- rowSums(is.na(SAMP))/ncol(SAMP) > 0.8
result[NArows] <- NA

一些基准测试:

Ricardo <- function(vec, thresh=12/15) {
    len <- length(vec)
    if( (sum(is.na(vec)) / len) > thresh)
        return(NA_real_)
    return(mean(vec, na.rm=TRUE))       
}

DanielFischer <- function(vec, thresh=12/15) {

    len <- length(vec)
    nas <- is.na(vec)
    Nna <- sum(nas)
    if( (Nna / len) > thresh)
        return(NA_real_)
    return(sum(vec[!nas])/(len-Nna))
}

thelatemail <- function(mat) {
    result <- rowMeans(mat, na.rm=TRUE)
    NArows <- rowSums(is.na(mat))/ncol(mat) > 0.8
    result[NArows] <- NA
    result
}

require(microbenchmark)
microbenchmark(m1 <- apply(SAMP, 1, Ricardo), 
               m2 <- apply(SAMP, 1, DanielFischer), 
               m3 <- thelatemail(SAMP), times = 5L)

Unit: milliseconds
                                expr       min        lq    median        uq       max neval
       m1 <- apply(SAMP, 1, Ricardo) 2923.7260 2944.2599 3066.8204 3090.8127 3105.4283     5
 m2 <- apply(SAMP, 1, DanielFischer) 2643.4883 2683.1034 2755.7032 2799.5155 3089.6015     5
                m3 <- latemail(SAMP)  337.1862  340.6339  371.6148  376.5517  383.4436     5

all.equal(m1, m2) # TRUE
all.equal(m1, m3) # TRUE

没错。@RicardoSaporta,你在调用sum(is.na(.)) nrow次。你可以使用向量化的rowSums代替。调用这些函数的函数开销将比计算本身更大。也许,分析Rprof(); modified <- apply(SAMP, 1, meanIfThresh); Rprof(NULL); summaryRprof()可能会有所帮助。 - Arun
3
@latemail,你应该将apply(SAMP, 1, mean, na.rm = TRUE)改为rowMeans(SAMP, na.rm=TRUE) - Arun
@thelatemail 谢谢您的回复。我应该在我的问题中更加明确,因为我需要执行这个函数几百次,因为我有几百个不同大小的组。 (混淆可能是我过于简化示例数据的原因。)然而...这激发了通过分组融合数据并以这种方式应用您的方法的想法。 - Ricardo Saporta

1

你的函数中是否需要两次遍历向量vec?如果你可以先存储NA,也许可以加快计算速度:

meanIfThresh2 <- function(vec, thresh=12/15) { 

  len <- length(vec)
  nas <- is.na(vec)
  Nna <- sum(nas)
  if( (Nna / len) > thresh)
    return(NA_real_)

  return(sum(vec[!nas])/(len-Nna))
}

编辑:我进行了类似的基准测试,以查看此更改的影响:

> microbenchmark(  "meanIfThresh"   = apply(SAMP, 1, meanIfThresh)
+                 , "meanIfThresh2"   = apply(SAMP, 1, meanIfThresh2)
+                 , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
+                 , times = 15L)
Unit: seconds
           expr      min       lq   median       uq      max neval
   meanIfThresh 2.009858 2.156104 2.158372 2.166092 2.192493    15
  meanIfThresh2 1.825470 1.828273 1.829424 1.834407 1.872028    15
 mean (regular) 1.868568 1.882526 1.889852 1.893564 1.907495    15

你应该做同样的基准测试。我认为这不会改变任何事情。在我看来,@thelatemail是正确的。 - Arun
好的,我在我的帖子中添加了基准测试 - 看起来存储NA会带来一些速度。至少它似乎和常规的平均函数一样快。 - Daniel Fischer
我本应该说,这不会改变太多。请检查latemail的答案中的编辑。 - Arun
是的,您使用 rowMeans 的建议带来了巨大的改变!但在此之前,我认为解决方案必须更慢,因为它首先执行常规平均值,然后再进行 is.na 步骤。我不知道 rowMeans 是如此快... - Daniel Fischer
@Daniel,这确实有帮助!它减少了调用“mean”的额外开销。 - Ricardo Saporta

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接