使用R的data.table将不可能的值替换为NA

Question

使用R的data.table将不可能的值替换为NA

3

我有一段代码，它会在数据集中将无法存在的值替换为NA。

我试图基于将代码进行转换，例如，将高度为<0>的值替换为高度为。

（虚拟）数据

 DT <- data.table(id = 1:5e6, 
                  height = sample(c(0, 100:240), 5e6, replace = TRUE))

目前我的解决方案比使用data.frame版本更慢，而且至少同样啰嗦。我猜我做错了什么...

DT[height == 0, height := NA]

在研究这个问题时，我找到了另一个解决方案，它更快（但更丑陋）。

set(DT, which("height"==0), "height", value = NA)

欢迎提出所有建议。

- s_baldur

1

DT[height == 0, height := NA] 运行缓慢吗？ - David Arenburg

你所研究的解决方案对我不起作用。 - Pierre L

1

@PierreLafortune 应该是这样的 set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA) 我猜。 - David Arenburg

比其他的更快 @DavidArenburg - Pierre L

谢谢@DavidArenburg。 - Arun

1

@PierreLafortune 我的错误：“更快（但更丑陋）。”我一直在使用的解决方案是set(DT, which(DT$height==0), "height", value = NA)。 - s_baldur

3个回答

6

一次对1亿行数据进行测速评估：

library(data.table)
DT <- data.table(id = 1:1e8, 
                 height = sample(c(0, 100:240), 1e8, replace = TRUE))
DT2 <- copy(DT);DT3 <- copy(DT); DT4 <- copy(DT); DT5 <- copy(DT); DT6 <- copy(DT);DT7 <- copy(DT)
library(microbenchmark)
microbenchmark(
  David    = set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA),
  OP       = DT2[height == 0, height := NA],
  akrun    = setkey(DT3, "height")[.(0), height := NA],
  isna     = {is.na(DT4$height) <- DT4$height == 0},
  assignNA = {DT5$height[DT5$height == 0] <- NA},
  indexset = {setindex(DT6, height); DT6[height==0, height := NA_real_]},
  exponent = DT7[, height:= NA^(!height)*height],
  times=1L
)
# Unit: milliseconds
# expr            min         lq       mean     median         uq        max neval
# David      585.9044   585.9044   585.9044   585.9044   585.9044   585.9044     1
# OP       10421.3323 10421.3323 10421.3323 10421.3323 10421.3323 10421.3323     1
# akrun    11922.5951 11922.5951 11922.5951 11922.5951 11922.5951 11922.5951     1
# isna      4843.3623  4843.3623  4843.3623  4843.3623  4843.3623  4843.3623     1
# assignNA  4797.0191  4797.0191  4797.0191  4797.0191  4797.0191  4797.0191     1
# indexset  6307.4564  6307.4564  6307.4564  6307.4564  6307.4564  6307.4564     1
# exponent  1054.6013  1054.6013  1054.6013  1054.6013  1054.6013  1054.6013     1

- Pierre L

在第一次运行中，进行了赋值操作，因此在后续的运行中，除非您在每次运行中创建数据集，否则我不确定这如何提供无偏的基准。 - akrun

每次运行都有单独的表格，它们之间并不相互连接。 - Pierre L

@Pierre 但是你使用了10个evals。 - Frank

我注意到了，但我的意思是在第一次运行之后，赋值已经发生了。 - akrun

我们可以使用replicate(100, {build DT; set na expression})来创建并运行函数多次，这样每次建立开销都是相同的。 - Pierre L

如果你有足够的内存，那是一个有趣的想法。此外，我猜你得自己编写摘要函数（而不是利用微基准测试）。 - Frank

5

我们可以尝试。

system.time(DT[, height:= NA^(!height)*height])
#  user  system elapsed 
#  0.03    0.05    0.08

原帖中的代码

system.time(DT[height == 0, height := NA])
#   user  system elapsed 
#   0.42    0.04    0.49

base R 选项应该更快。

system.time(DT$height[DT$height == 0] <- NA)
#   user  system elapsed 
#  0.19    0.05    0.23

并且使用 is.na 方法

system.time(is.na(DT$height) <- DT$height == 0)
#  user  system elapsed 
#   0.22    0.06    0.28

@DavidArenburg的建议

system.time(set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA))
#   user  system elapsed 
#   0.06    0.00    0.06

注意：所有这些基准测试都是在每次运行之前新创建数据集，以提供一些无偏的基准测试。我可以使用 microbenchmark ，但每次运行中会有一些偏差，因为第一次运行时会发生分配。

使用更大的数据集

set.seed(24)
DT <- data.table(id = 1:1e8, 
             height = sample(c(0, 100:240), 1e8, replace = TRUE))

system.time(DT[, height:= NA^(!height)*height])
#  user  system elapsed 
#  0.58    0.24    0.81 

system.time(set(DT, i = which(DT[["height"]] == 0), j = "height", value = NA))
#   user  system elapsed 
#   0.49    0.12    0.61

数据

set.seed(24)
DT <- data.table(id = 1:1e7, 
             height = sample(c(0, 100:240), 1e7, replace = TRUE))

- akrun

@snoram 在后续的运行中会更快。 - akrun

@akrun 的 data.table 设置了次要键，因此 DT[height == 0, height := NA] 在后续运行中也会更快。 - David Arenburg

1

此外，如果 height 是整数，则性能的提高是非常有趣的，这可能对 OP 有所帮助。 - Frank

@DavidArenburg 我正在对原始数据集进行基准测试，即在每次运行后重新创建数据。 - akrun

1

@snoram !height会给出一个逻辑向量，其中0的高度为TRUE/FALSE。它也可以写成height == 0。使用NA^(!height)将TRUE值即0值更改为NA和所有其他1，然后通过与height相乘，任何与NA相乘的数字都返回NA，而与1相乘则返回该值。 - akrun

显示剩余3条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Arun · Accepted Answer

自从v1.9.4版本以来，data.table在[.data.table调用中使用x == val和x %in% val的形式进行子集创建时，默认情况下会自动在列上创建索引。这使得后续的子集操作非常快速，只需要在第一个子集上稍微付出更高的代价（因为data.table的基数排序非常快）。第一个子集可能会变慢，因为它是用于：

创建索引
然后再进行子集操作。

为了说明这一点（使用@akrun的数据）：

require(data.table)
getOption("datatable.auto.index") # [1] TRUE ===> enabled

set.seed(24)
DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE))

system.time(DT[height == 0L])
#   0.396   0.059   0.452 ## first run
#   0.003   0.000   0.004 ## second run is very fast

现在，如果我们禁用自动索引：

require(data.table)
options(datatable.auto.index = FALSE)
getOption("datatable.auto.index") # [1] FALSE

set.seed(24)
DT <- data.table(id = 1:1e7, height = sample(c(0, 100:240), 1e7, replace = TRUE))

system.time(DT[height == 0L])
#   0.037   0.007   0.042 ## first run
#   0.039   0.010   0.045 ## second run (~ 10x slower than 2nd run above)

options(datatable.auto.index = TRUE) # restore auto indexing if necessary

但你的情况很特殊，因为你更新了子集中相同的列。实质上，这就是发生的事情：

看到 i 表达式可以优化自动索引，索引被创建并保存以供后续快速子集使用。
看到 j 表达式，列被更新。
已设置索引的列已被更新，因此删除索引。

如果任何行评估为 TRUE，那么自动索引逻辑应该检测到这一点，并完全跳过创建索引，因为创建的索引基本上是无用的。

请在项目问题页面报告问题，只链接到此 SO Q 应该足够了。

回答您的问题，请禁用自动索引并运行子集，它应该与使用 set() 获得的时间大致相同。

Base R 解决方案在这里不能更快，因为它复制整个列以更新这些条目。但这是因为 Base R 选择这样做。