在data.table中按组进行分位数切割

10

我想为每个组进行分位数切割(将其分成具有相等数据点数量的n个箱子)。

qcut = function(x, n) {
  quantiles = seq(0, 1, length.out = n+1)
  cutpoints = unname(quantile(x, quantiles, na.rm = TRUE))
  cut(x, cutpoints, include.lowest = TRUE)
}

library(data.table)
dt = data.table(A = 1:10, B = c(1,1,1,1,1,2,2,2,2,2))
dt[, bin := qcut(A, 3)]
dt[, bin2 := qcut(A, 3), by = B]

dt
A     B    bin        bin2
 1:  1 1  [1,4]    [6,7.33]
 2:  2 1  [1,4]    [6,7.33]
 3:  3 1  [1,4] (7.33,8.67]
 4:  4 1  [1,4]   (8.67,10]
 5:  5 1  (4,7]   (8.67,10]
 6:  6 2  (4,7]    [6,7.33]
 7:  7 2  (4,7]    [6,7.33]
 8:  8 2 (7,10] (7.33,8.67]
 9:  9 2 (7,10]   (8.67,10]
10: 10 2 (7,10]   (8.67,10]

在不分组的情况下,这里的切割是正确的 -- 数据位于箱中。但是按组计算的结果是错误的。

我该如何修复这个问题?


2
dt[, qcut(A, 3), by = B] 可以正常工作。 - Cath
1个回答

8

这是处理因子时的一个错误。请检查是否已知(或已在开发版本中修复),否则请将其报告给data.table bug跟踪器。

qcut = function(x, n) {
  quantiles = seq(0, 1, length.out = n+1)
  cutpoints = unname(quantile(x, quantiles, na.rm = TRUE))
  as.character(cut(x, cutpoints, include.lowest = TRUE))
}

dt[, bin2 := qcut(A, 3), by = B]
#     A B    bin        bin2
# 1:  1 1  [1,4]    [1,2.33]
# 2:  2 1  [1,4]    [1,2.33]
# 3:  3 1  [1,4] (2.33,3.67]
# 4:  4 1  [1,4]    (3.67,5]
# 5:  5 1  (4,7]    (3.67,5]
# 6:  6 2  (4,7]    [6,7.33]
# 7:  7 2  (4,7]    [6,7.33]
# 8:  8 2 (7,10] (7.33,8.67]
# 9:  9 2 (7,10]   (8.67,10]
#10: 10 2 (7,10]   (8.67,10]

5
不改变功能的情况下,dt[, bin2 := as.character(qcut(A, 3)), by=B]同样有效。如果尝试将其转换为因子 (dt[, bin2 := as.factor(as.character(qcut(A, 3))), by=B]),则会报错... - Cath
是的,如果您按组定义因子,则最终列(组合组)将只从第1组获取属性(如级别),我想https://github.com/Rdatatable/data.table/issues/967 - Frank

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接