在"data.table"中,"J"函数分组列长度为1。

11

我在学习 data.table 的过程中,发现一个我无法优雅地解决的情况。

提前说明: lm 公式的荒谬是显而易见的,我正在尝试确定是否可以通过 data.table 生态系统内的关键字或特殊运算符轻松解决这个细微差别。

library(data.table)
mt <- as.data.table(mtcars)
mt[, list(model = list(lm(mpg ~ disp))), by = "cyl"]
#    cyl model
# 1:   6  <lm>
# 2:   4  <lm>
# 3:   8  <lm>
mt[, list(model = list(lm(mpg ~ disp + cyl))), by = "cyl"]
# Error in model.frame.default(formula = mpg ~ disp + cyl, drop.unused.levels = TRUE) : 
#   variable lengths differ (found for 'cyl')

由于在此块内,cyl 是长度为1的向量,而不是像其他值一样的列:
mt[, list(model = { browser(); list(lm(mpg ~ cyl+disp)); }), by = "cyl"]
# Called from: `[.data.table`(mt, , list(model = {
#     browser()
#     list(lm(mpg ~ cyl + disp))
#   ...
# Browse[1]> 
# debug at #1: list(lm(mpg ~ cyl + disp))
# Browse[2]> 
disp
# [1] 160.0 160.0 258.0 225.0 167.6 167.6 145.0
# Browse[2]> 
cyl
# [1] 6

似乎最简单的方法是在需要时手动将其内部作为临时变量或直接进行长度加长:

mt[, list(model = { cyl2 <- rep(cyl, nrow(.SD)); list(lm(mpg ~ cyl2+disp)); }), by = "cyl"]
mt[, list(model = list(lm(mpg ~ rep(cyl, nrow(.SD))+disp))), by = "cyl"]

问题:有没有更优雅的方法来处理这个问题?


各种松散相关的问题,引发了我的好奇心(对于在DT对象中嵌入“东西”):


目前的候选人很多,都很不错:

mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]
mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]

2
没有太大的改进,但是为了避免在J内部创建列,您可以制作重复的“cyl”列,然后使用它? - Mike H.
4
您可以随时更改by=的名称 - mt[, length(cyl), by = .(cylgroup=cyl)] - thelatemail
2
另外,一个更广泛的问题是:这是关于一般使用还是特定于lm的?我不认为将常量(对于每个组,它将是cyl或任何您分组的内容)传递给lm有任何意义。 - Mike H.
4
最后一句话 - 你也可以从.SDcols=参数引用cyl - mt[, .(.(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)] 或者甚至是mt[, .(.(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE] - thelatemail
2
“更优雅的方式”可能是基于个人观点的。顺便说一下,我认为thela的方法很好,但也有一个数据=参数:mtm = mt[, list(model = list(lm(mpg ~ disp + cyl, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"] - Frank
显示剩余10条评论
1个回答

2
感谢所有候选人。
mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]
mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]
mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]

这个小模型的性能似乎有一些小差异:

library(microbenchmark)
microbenchmark(
  c1 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)],
  c2 = mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
  c3 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)],
  c4 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
  c5 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
#  expr    min      lq     mean  median      uq     max neval
#    c1 3.7328 4.21745 4.584591 4.43485 4.57465  9.8924   100
#    c2 2.6740 3.11295 3.244856 3.21655 3.28975  5.6725   100
#    c3 2.8219 3.30150 3.618646 3.46560 3.81250  6.8010   100
#    c4 2.9084 3.27070 3.620761 3.44120 3.86935  6.3447   100
#    c5 5.6156 6.37405 6.832622 6.54625 7.03130 13.8931   100

随着数据规模的增大

mtbigger <- rbindlist(replicate(1000, mtcars, simplify=FALSE))
microbenchmark(
  c1 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = mtbigger[.I]))), by = .(cyl)],
  c2 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
  c3 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mtbigger)],
  c4 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
  c5 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
)
# Unit: milliseconds
#  expr     min       lq     mean  median       uq      max neval
#    c1 27.1635 30.54040 33.98210 32.2859 34.71505  76.5064   100
#    c2 23.9612 25.83105 28.97927 27.5059 30.02720  67.9793   100
#    c3 25.7880 28.27205 31.38212 30.2445 32.79030 105.4742   100
#    c4 25.6469 27.84185 30.52403 29.8286 32.60805  37.8675   100
#    c5 29.2477 32.32465 35.67090 35.0291 37.90410  68.5017   100

我猜相对性能应该类似。更好的裁决可能包括更广泛的数据。
仅按中位数运行时间来看,看起来排名最高的是(仅仅略高一点):
mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接