我有一个数据表格,其中包含列a和列b。我已将其分成两部分:below
,其中b < .5,以及above
,其中b > .5:
DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1))
above = DT[DT$b > .5]
below = DT[DT$b < .5, list(a=a)]
我希望能够在above
和below
之间进行左外连接:对于above
中的每个a
,计算below
中的行数。这等同于在SQL中执行以下操作:
with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 union select 2, 1 union select 3, 1 union select 3, 1),
above as (select a, b from dt where b > .5),
below as (select a, b from dt where b < .5)
select above.a, count(below.a) from above left outer join below on (above.a = below.a) group by above.a;
a | count
---+-------
3 | 0
2 | 1
(2 rows)
我该如何使用data.tables实现同样的功能?目前我尝试了以下方法:
> key(below) = 'a'
> below[above, list(count=length(b))]
a count
[1,] 2 1
[2,] 3 1
[3,] 3 1
> below[above, list(count=length(b)), by=a]
Error in eval(expr, envir, enclos) : object 'b' not found
> below[, list(count=length(a)), by=a][above]
a count b
[1,] 2 1 1
[2,] 3 NA 1
[3,] 3 NA 1
我需要更具体地说明一下,我已经尝试过使用merge
,但这会消耗我的系统内存(而且数据集仅占据我的内存约20%)。