使用apply()或lapply()等方法加速for循环

5

我编写了一个特殊的“impute”函数,根据特定列名,用mean()或mode()替换具有缺失(NA)值的列值。

输入数据框有400,000+行,速度非常慢,如何使用lapply()或apply()加速填充部分。

这是函数,我想要优化的部分用START OPTIMIZE和END OPTIMIZE标记:

specialImpute <- function(inputDF) 
{

  discoveredDf <- data.frame(STUDYID_SUBJID=character(), stringsAsFactors=FALSE)
  dfList <- list()
  counter = 1; 

  Whilecounter = nrow(inputDF)
  #for testing just do 10 iterations,i = 10;

  while (Whilecounter >0)
  {

    studyid_subjid=inputDF[Whilecounter,"STUDYID_SUBJID"]

    vect = which(discoveredDf$STUDYID_SUBJID == studyid_subjid)
    #was discovered and subset before 
    if (!is.null(vect))
    {
      #not subset before 
      if (length(vect)<1)
      {
      #subset the dataframe base on regex inputDF$STUDYID_SUBJID
    df <- subset(inputDF, regexpr(studyid_subjid, inputDF$STUDYID_SUBJID) > 0)

      #START OPTIMIZE
      for (i in nrow(df))
      {
      #impute , add column mean & add to list

      #apply(df[,c("y1","y2","y3","etc..")],2,function(x){x[is.na(x)] =mean(x, na.rm=TRUE)})

      if (is.na(df[i,"y1"])) {df[i,"y1"] = mean(df[,"y1"], na.rm = TRUE)}
      if (is.na(df[i,"y2"])) {df[i,"y2"] =mean(df[,"y2"], na.rm = TRUE)}
      if (is.na(df[i,"y3"])) {df[i,"y3"] =mean(df[,"y3"], na.rm = TRUE)}
      #impute using mean for CONTINUOUS variables
        if (is.na(df[i,"COVAR_CONTINUOUS_2"])) {df[i,"COVAR_CONTINUOUS_2"] =mean(df[,"COVAR_CONTINUOUS_2"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_3"])) {df[i,"COVAR_CONTINUOUS_3"] =mean(df[,"COVAR_CONTINUOUS_3"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_4"])) {df[i,"COVAR_CONTINUOUS_4"] =mean(df[,"COVAR_CONTINUOUS_4"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_5"])) {df[i,"COVAR_CONTINUOUS_5"] =mean(df[,"COVAR_CONTINUOUS_5"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_6"])) {df[i,"COVAR_CONTINUOUS_6"] =mean(df[,"COVAR_CONTINUOUS_6"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_7"])) {df[i,"COVAR_CONTINUOUS_7"] =mean(df[,"COVAR_CONTINUOUS_7"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_10"])) {df[i,"COVAR_CONTINUOUS_10"] =mean(df[,"COVAR_CONTINUOUS_10"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_14"])) {df[i,"COVAR_CONTINUOUS_14"] =mean(df[,"COVAR_CONTINUOUS_14"], na.rm = TRUE)}
        if (is.na(df[i,"COVAR_CONTINUOUS_30"])) {df[i,"COVAR_CONTINUOUS_30"] =mean(df[,"COVAR_CONTINUOUS_30"], na.rm = TRUE)}
      #impute using mode ordinal & nominal values
        if (is.na(df[i,"COVAR_ORDINAL_1"]))  {df[i,"COVAR_ORDINAL_1"] =Mode(df[,"COVAR_ORDINAL_1"])}
        if (is.na(df[i,"COVAR_ORDINAL_2"]))  {df[i,"COVAR_ORDINAL_2"] =Mode(df[,"COVAR_ORDINAL_2"])}
        if (is.na(df[i,"COVAR_ORDINAL_3"]))  {df[i,"COVAR_ORDINAL_3"] =Mode(df[,"COVAR_ORDINAL_3"])}
        if (is.na(df[i,"COVAR_ORDINAL_4"]))  {df[i,"COVAR_ORDINAL_4"] =Mode(df[,"COVAR_ORDINAL_4"])}
      #nominal 
        if (is.na(df[i,"COVAR_NOMINAL_1"]))  {df[i,"COVAR_NOMINAL_1"] =Mode(df[,"COVAR_NOMINAL_1"])}
        if (is.na(df[i,"COVAR_NOMINAL_2"]))  {df[i,"COVAR_NOMINAL_2"] =Mode(df[,"COVAR_NOMINAL_2"])}
        if (is.na(df[i,"COVAR_NOMINAL_3"]))  {df[i,"COVAR_NOMINAL_3"] =Mode(df[,"COVAR_NOMINAL_3"])}
        if (is.na(df[i,"COVAR_NOMINAL_4"]))  {df[i,"COVAR_NOMINAL_4"] =Mode(df[,"COVAR_NOMINAL_4"])}
        if (is.na(df[i,"COVAR_NOMINAL_5"]))  {df[i,"COVAR_NOMINAL_5"] =Mode(df[,"COVAR_NOMINAL_5"])}
        if (is.na(df[i,"COVAR_NOMINAL_6"]))  {df[i,"COVAR_NOMINAL_6"] =Mode(df[,"COVAR_NOMINAL_6"])}
        if (is.na(df[i,"COVAR_NOMINAL_7"]))  {df[i,"COVAR_NOMINAL_7"] =Mode(df[,"COVAR_NOMINAL_7"])}
        if (is.na(df[i,"COVAR_NOMINAL_8"]))  {df[i,"COVAR_NOMINAL_8"] =Mode(df[,"COVAR_NOMINAL_8"])}

      }#for
      #END OPTIMIZE

      dfList[[counter]] <- df 
      #add to discoveredDf since already substed
      discoveredDf[nrow(discoveredDf)+1,]<- c(studyid_subjid)
      counter = counter +1;
      #for debugging to check progress
        if (counter %% 100 == 0)
        {
        print(counter)
        }
      }
    }


    Whilecounter  = Whilecounter  -1;
  }#end while
  return (dfList)

}

谢谢


1
if (is.na(df[i, "y1"] ... 移到循环外面,作为 df[is.na(df$y1), "y1"] = mean(df$y1, na.rm=TRUE)。这样可以将迭代向量化,每列只有一个 R 调用,而不是 nrow(df) 次调用。对循环中的所有行重复此操作。 - Martin Morgan
3个回答

13
很可能可以通过对每个使用向量化函数来提高性能。目前,您正在迭代每一行,然后单独处理每个列,这会使速度变慢。另一个改进是通用代码,使您不必为每个变量输入新行。在我下面给出的示例中,这已处理,因为连续变量是数值型,而分类变量是因素。
要直接得出答案,您可以将代码替换为以下优化代码(但需要修复变量名称),前提是您的数字变量是数值或序数/分类变量不是(例如,因素):
impute <- function(x) {
  if (is.numeric(x)) {  # If numeric, impute with mean
    x[is.na(x)] <- mean(x, na.rm = TRUE)
  } else {                # mode otherwise
    x[is.na(x)] <- names(which.max(table(x)))
  }
  x
}

# Correct cols_to_impute with names of your variables to be imputed
# e.g., c("COVAR_CONTINUOUS_2", "COVAR_NOMINAL_3", ...)  
cols_to_impute <- names(df) %in% c("names", "of", "columns")
library(purrr)
df[, cols_to_impute] <- dmap(df[, cols_to_impute], impute)

以下是五种方法的详细比较:
  • 使用 for 循环迭代行,然后单独处理每一列。
  • 使用 for 循环。
  • 使用 lapply() 函数。
  • 使用 sapply() 函数。
  • 使用来自 purrr 包的 dmap() 函数。
新的方法都通过列迭代数据框,并利用一个名为 impute 的向量化函数,在向量中用平均值(如果是数字)或众数(否则)填充缺失值。除了 sapply() 函数外(您将看到原因),它们的差异相对较小,但有趣可供检查。
以下是我们将使用的实用程序函数:
# Function to simulate a data frame of numeric and factor variables with
# missing values and `n` rows
create_dat <- function(n) {
  set.seed(13)
  data.frame(
    con_1 = sample(c(10:20, NA), n, replace = TRUE),   # continuous w/ missing
    con_2 = sample(c(20:30, NA), n, replace = TRUE),   # continuous w/ missing
    ord_1 = sample(c(letters, NA), n, replace = TRUE), # ordinal w/ missing
    ord_2 = sample(c(letters, NA), n, replace = TRUE)  # ordinal w/ missing
  )
}

# Function that imputes missing values in a vector with mean (if numeric) or
# mode (otherwise)
impute <- function(x) {
  if (is.numeric(x)) {  # If numeric, impute with mean
    x[is.na(x)] <- mean(x, na.rm = TRUE)
  } else {                # mode otherwise
    x[is.na(x)] <- names(which.max(table(x)))
  }
  x
}

现在,每种方法的包装函数:
# Original approach
func0 <- function(d) {
  for (i in 1:nrow(d)) {
    if (is.na(d[i, "con_1"])) d[i,"con_1"] <- mean(d[,"con_1"], na.rm = TRUE)

    if (is.na(d[i, "con_2"])) d[i,"con_2"] <- mean(d[,"con_2"], na.rm = TRUE)

    if (is.na(d[i,"ord_1"])) d[i,"ord_1"] <- names(which.max(table(d[,"ord_1"])))

    if (is.na(d[i,"ord_2"])) d[i,"ord_2"] <- names(which.max(table(d[,"ord_2"])))
  }
  return(d)
}

# for loop operates directly on d
func1 <- function(d) {
  for(i in seq_along(d)) {
    d[[i]] <- impute(d[[i]])
  }
  return(d)
}

# Use lapply()
func2 <- function(d) {
  lapply(d, function(col) {
    impute(col)
  })
}

# Use sapply()
func3 <- function(d) {
  sapply(d, function(col) {
    impute(col)
  })
}

# Use purrr::dmap()
func4 <- function(d) {
  purrr::dmap(d, impute)
}

现在,我们将比较这些方法的性能,其中 n 的范围从 10 到 100(非常小):

library(microbenchmark)
ns <- seq(10, 100, by = 10)
times <- sapply(ns, function(n) {
  dat <- create_dat(n)
  op <- microbenchmark(
    ORIGINAL = func0(dat),
    FOR_LOOP = func1(dat),
    LAPPLY   = func2(dat),
    SAPPLY   = func3(dat),
    DMAP     = func4(dat)
  )
  by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))

# Plot the results
library(tidyr)
library(ggplot2)

times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
  geom_point(position = pd) +
  geom_line(position = pd) +
  theme_bw()

在这里输入图片描述

很明显,原始方法比每列使用向量化函数impute的新方法要慢得多。那么新方法之间有什么区别呢? 让我们增加样本量来检查:

ns <- seq(5000, 50000, by = 5000)
times <- sapply(ns, function(n) {
  dat <- create_dat(n)
  op <- microbenchmark(
    FOR_LOOP = func1(dat),
    LAPPLY   = func2(dat),
    SAPPLY   = func3(dat),
    DMAP     = func4(dat)
  )
  by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))
times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
  geom_point(position = pd) +
  geom_line(position = pd) +
  theme_bw()

在此输入图片描述

看起来sapply()并不是很好(如@Martin所指出的)。这是因为sapply()正在做额外的工作,以将我们的数据变成矩阵形式(我们不需要)。如果您自己运行此操作而没有使用sapply(),则会发现其余方法都相当可比。

因此,主要的性能改进是在每个列上使用向量化函数。我一开始建议使用dmap,因为我喜欢函数样式和一般的purrr包,但您可以放心地替换为您喜欢的任何方法。

另外,非常感谢@Martin提供了非常有用的评论,让我改进了这个答案!


2
好答案。成本在于使用sapply()函数;与lapply()函数进行比较(for循环修改了d,最好编写带有d作为参数的函数,然后运行system.time(f0(d)),还可以参考microbenchmark包)。ifelse()函数并不适用于测试标量条件(两个结果都会被评估),所以请使用一个简单的if () else语句。对于l/sapply方法的警告是有意义的 - 两种方法的结果不相同,请使用names(which.max(table(col)))。总的来说,得到正确的答案比得到快速的答案更重要,即identical(f0(d), f1(d)) - Martin Morgan
@SimonJackson 不需要使用 as.data.frame 来进行 lapply。可以用 'dmap' 来写,如下所示:df[, cols_to_impute] <- lapply(df[, cols_to_impute], impute) - Gregory Demin
@GregoryDemin 不错的观点,对于sapply()也适用,是吧?我进行了编辑。 - Simon Jackson
很棒的答案。我也喜欢purrr语法。你有什么想法为什么它在最小的_n_上表现一般般? - aurelien

1
如果你要处理类似矩阵的数据,建议使用矩阵而不是数据框,因为像访问矩阵一样访问数据框会非常耗时。你可以将数值提取到矩阵中进行部分计算。这可以显著提高速度。

1

以下是使用data.table的非常简单快速的解决方案。

library(data.table)

# name of columns
cols <- c("a", "c")

# impute date
setDT(dt)[, (cols) := lapply(.SD, function(x) ifelse( is.na(x) & is.numeric(x), mean(x, na.rm = T),
                                               ifelse( is.na(x) & is.character(x), names(which.max(table(x))), x)))  , .SDcols = cols ]

我还没有将这个解决方案与@Simon Jackson提供的解决方案进行性能比较,但这应该非常快。

可重现示例中的数据

set.seed(25)
dt <- data.table(a=c(1:5,NA,NA,1,1), 
                 b=sample(1:15, 9, replace=TRUE), 
                 c=LETTERS[c(1:6,NA,NA,1)])

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接