给定两个数据表(tbl_A
和tbl_B
),我想选择在tbl_B
中有匹配行的所有tbl_A
中的行,并且我希望代码具有表达性。如果为data.tables定义了%in%
运算符,那么像这样就是理想的:
subset <- tbl_A[tbl_A %in% tbl_B]
我能想到很多实现我所需的方法,例如:
# double negation (set differences)
subset <- tbl_A[!tbl_A[!tbl_B,1,keyby=a]]
# nomatch with keyby and this annoying `[,V1:=NULL]` bit
subset <- tbl_B[,1,keyby=.(a=x)][,V1:=NULL][tbl_A,nomatch=0L]
# nomatch with !duplicated() and setnames()
subset <- tbl_B[!duplicated(tbl_B),.(x)][tbl_A,nomatch=0L]; setnames(subset,"x","a")
# nomatch with !unique() and setnames()
subset <- unique(tbl_B)[,.(x)][tbl_A,nomatch=0L]; setnames(subset,"x","a")
# use of a temporary variable (Thanks @Frank)
subset <- tbl_A[, found := FALSE][tbl_B, found := TRUE][(found)][,found:=NULL][]
但每个表达式都很难阅读,而且乍一看不清楚代码在做什么。有没有更加惯用/富有表现力的方法来完成这个任务呢?
为了举例说明,这里有一些玩具数据表:
# toy tables
tbl_A <- data.table(a=letters[1:5],
b=1:5,
c=rnorm(5))
tbl_B <- data.table(x=letters[3:7],
y=13:17,
z=rnorm(5))
# both tables might have multiple rows with the same key fields.
tbl_A <- rbind(tbl_A,tbl_A)
tbl_B <- rbind(tbl_B,tbl_B)
setkey(tbl_A,a)
setkey(tbl_B,x)
期望的结果包含与tbl_B
中至少一行匹配的tbl_A
中的行:
a b c
1: c 3 -0.5403072
2: c 3 -0.5403072
3: d 4 -1.3353621
4: d 4 -1.3353621
5: e 5 1.1811730
6: e 5 1.1811730
tbl_A[, found := FALSE][tbl_B, found := TRUE]
? 顺便提一下,关于 %in% 运算符的问题/请求,请在 https://github.com/Rdatatable/data.table/issues/2279 上开启。 - Frank