dplyr 0.3无法与data.table进行内部连接?

4

我有以下设置,并加载了dplyr(0.3)和data.table(1.9.3)。

R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.3 dplyr_0.3       

loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1      magrittr_1.0.1 parallel_3.1.1 plyr_1.8.1     Rcpp_0.11.2   
[7] reshape2_1.4   stringr_0.6.2  tools_3.1.1 

以下是数据集。其中包含两个数据表和两个数据框架,两个数据集内容相同。

DT_1 = data.table(x = rep(c("a","b","c"), each = 3), y = c(1,3,6), v = 1:9)
DT_2 = data.table(V1 = c("b","c"),foo = c(4,2))

DT_1_df = data.frame(x = rep(c("a","b","c"), each = 3), y = c(1,3,6), v = 1:9)
DT_2_df = data.frame(V1 = c("b","c"),foo = c(4,2))

使用data.table的方式

使用data.table的方式进行两个数据表的内连接,我们可以得到以下结果:

setkey(DT_1, x); setkey(DT_2, V1)
DT_1[DT_2]
  x y v foo
1: b 1 4   4
2: b 3 5   4
3: b 6 6   4
4: c 1 7   2
5: c 3 8   2
6: c 6 9   2

dplyr 0.3在data.tables上使用inner_join

当使用dplyr的inner_join在两个data-tables上时,会出现错误:

inner_join(DT_1, DT_2, by=("x"="V1"))
Error in setkeyv(x, by$x) : some columns are not in the data.table: V1

dplyr0.3内部连接数据框&数据表

如果使用数据表与数据框,则会出现另一个错误:

inner_join(DT_1, DT_2_df, by = c("x" = "V1"))
Error: Data table joins must be on same key

dplyr0.3的数据框内部联接

然而,inner_join函数在数据框中表现出色:

inner_join(DT_1_df, DT_2_df, by = c("x" = "V1"))
  x y v foo
1 b 1 4   4
2 b 3 5   4
3 b 6 6   4
4 c 1 7   2
5 c 3 8   2
6 c 6 9   2

有谁可以解释一下为什么会发生这种情况吗?


1
显然,最明显的解释是这是一个错误? - hadley
@suspecting it's a bug as well unless it's an intended design which is unlikely. dplyr and data.table are very useful packages. It'd be great if the functions of the packages could work on both dataframes and datatables seamlessly. Thank you! - KFB
1个回答

1

为了完整起见,在此发布研究结果。

在检查https://github.com/hadley/dplyr后,似乎dplyr的“join”目前功能有限。引用一句话:“当前,加入变量必须在左侧和右侧具有相同的值。”下面的测试似乎证实了这一点:

library(dplyr); library(data.table)
DT_1 = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DT_2 = data.table(V1=c("b","c"),foo=c(4,2)) # note the variable name assigned to first column
DT_2b = data.table(x=c("b","c"),foo=c(4,2)) # note the variable name assigned to first column

inner_join(DT_1, DT_2b, by= "x")
Source: local data table [6 x 4]
  x y v foo
1 b 1 4   4
2 b 3 5   4
3 b 6 6   4
4 c 1 7   2
5 c 3 8   2
6 c 6 9   2

inner_join(DT_1, DT_2, by = c("x" = "V1"))
Error: Data table joins must be on same key

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接