我想要在data.table
中应用一个函数,对数据的不同子集进行操作。以下示例将希望说明我的实现目标:
library(data.table)
# generate data
set.seed(123)
(dt = data.table(id = 1:20,
grp = sample(letters[1:3], size = 20, replace = TRUE),
R = sample(255, size = 20),
G = sample(255, size = 20),
B = sample(255, size = 20)))
#> id grp R G B
#> 1: 1 c 137 7 141
#> 2: 2 c 221 137 210
#> 3: 3 c 99 169 97
#> 4: 4 b 72 74 249
#> 5: 5 c 26 23 91
#> 6: 6 b 7 155 153
#> 7: 7 b 170 188 38
#> 8: 8 b 255 53 21
#> 9: 9 c 211 135 207
#> 10: 10 a 164 248 41
#> 11: 11 b 78 250 175
#> 12: 12 b 81 224 90
#> 13: 13 a 43 166 60
#> 14: 14 b 103 217 223
#> 15: 15 c 117 34 16
#> 16: 16 a 76 221 116
#> 17: 17 c 143 69 94
#> 18: 18 c 32 72 6
#> 19: 19 a 234 76 235
#> 20: 20 a 109 63 200
假设我想将以下函数应用于每个
"grp"
组中的 3 列("R"
、"G"
和 "B"
)。因此,它需要 3 个长度为 n 的向量,并返回一个长度为 n 的向量。fun = function(x1, x2, x3) {
normalize = function(x) (x - min(x)) / diff(range(x))
sqrt(normalize(x1)^2 + normalize(x2)^2 + normalize(x3)^2)
}
# mapping the column names of dt to the argument names of fun
vars = c(x1 = "R", x2 = "G", x3 = "B")
下面的代码产生了我想要的输出,但我正在寻找更高效的解决方案。
# solution, but very ugly and inefficient
dtgs = lapply(letters[1:3], function(g) {
dtg = dt[grp==g,]
dtg[, value := do.call(fun, unname(as.list(dtg[, vars, with = FALSE])))]
})
rbindlist(dtgs)
#> id grp R G B value
#> 1: 10 a 164 248 41 1.1837788
#> 2: 13 a 43 166 60 0.5653052
#> 3: 16 a 76 221 116 0.9532667
#> 4: 19 a 234 76 235 1.4159583
#> 5: 20 a 109 63 200 0.8894540
#> 6: 4 b 72 74 249 1.0392584
#> 7: 6 b 7 155 153 0.7766996
#> 8: 7 b 170 188 38 0.9524469
#> 9: 8 b 255 53 21 1.0000000
#> 10: 11 b 78 250 175 1.2402336
#> 11: 12 b 81 224 90 0.9664781
#> 12: 14 b 103 217 223 1.2758577
#> 13: 1 c 137 7 141 0.8729010
#> 14: 2 c 221 137 210 1.6260248
#> 15: 3 c 99 169 97 1.1572081
#> 16: 5 c 26 23 91 0.4282122
#> 17: 9 c 211 135 207 1.5796092
#> 18: 15 c 117 34 16 0.4979543
#> 19: 17 c 143 69 94 0.8321982
#> 20: 18 c 32 72 6 0.4024126