当原始群组缺乏足够的观察数据时，创建新的群组。

Question

当原始群组缺乏足够的观察数据时，创建新的群组。

6

我有以下示例数据：

library(data.table)
sample <- fread("
1,0,2,NA,cat X, type 1
3,4,3,1,cat X, type 2
1,0,2,2,cat X, type 3
3,4,3,0,cat X, type 4
1,0,2,NA,cat Y, type 1
3,4,3,NA,cat Y, type 2
1,0,2,2,cat Y, type 3
3,4,3,35,cat Y, type 4
1,0,2,NA,cat X, type 1
3,4,3,1,cat X, type 2
1,0,2,2,cat X, type 3
3,4,3,NA,cat X, type 4
1,0,2,NA,cat Y, type 1
3,4,3,NA,cat Y, type 2
1,0,2,2,cat Y, type 3
3,4,3,1,cat Y, type 4
1,0,2,4,cat X, type 1
3,4,3,1,cat X, type 2
1,0,2,2,cat X, type 3
3,4,3,2,cat X, type 4
1,0,2,NA,cat Y, type 1
3,4,3,NA,cat Y, type 2
1,0,2,2,cat Y, type 3
3,4,3,2,cat Y, type 4
")

names(sample) <- c("A","B","C", "D", "cat", "type")

sample <- sample[, observations := sum(!is.na(D)), by = c("cat", "type")]

    A B C  D   cat   type observations
 1: 1 0 2 NA cat X type 1            1
 2: 3 4 3  1 cat X type 2            3
 3: 1 0 2  2 cat X type 3            3
 4: 3 4 3  0 cat X type 4            2
 5: 1 0 2 NA cat Y type 1            0
 6: 3 4 3 NA cat Y type 2            0
 7: 1 0 2  2 cat Y type 3            3
 8: 3 4 3 35 cat Y type 4            3
 9: 1 0 2 NA cat X type 1            1
10: 3 4 3  1 cat X type 2            3
...
24: 3 4 3  0 cat Y type 4            3

我想将邻近组的“类型”相加，如果它们少于两个观测值。

例如：将只有1个观测值的“类型1”组与第2组的观测值相加（请参见所需输出的第一行）。

需要将类型进行汇总，直到所有剩余类别至少有2个观测值。所以，“Y类别”的“类型1”和“类型2”需要与“类型3”合并。

我正在尝试编写代码的方式，但遇到了麻烦。

有人可以建议一种自动创建新类型的好方法吗？

我意识到可能存在某些情况，其中可能有两种汇总组的解决方案。然而，只要被相加的组是相邻的组（因此不会将“类型1”添加到“类型4”中），哪些组被相加在一起并不重要。

期望输出：

    A B C  D   cat   type  new_type observations
 1: 1 0 2 NA cat X type 1  type 2          4
 2: 3 4 3  1 cat X type 2  type 2          4
 3: 1 0 2  2 cat X type 3  type 3          3
 4: 3 4 3  0 cat X type 4  type 4          2
 5: 1 0 2  2 cat Y type 1  type 3          3
 6: 3 4 3 NA cat Y type 2  type 3          3
 7: 1 0 2  2 cat Y type 3  type 3          3
 8: 3 4 3  0 cat Y type 4  type 4          3
 9: 1 0 2 NA cat X type 1  type 2          4
10: 3 4 3  1 cat X type 2  type 2          4
...
24: 3 4 3  0 cat Y type 4  type 4          3

解决方案不一定要使用data.table

- Tom

为什么你期望的输出中第3-4行是单独的？你说如果它们少于两个，它们应该与相邻的组合并。 - r2evans

@r2evans 抱歉，是我犯了错误，它们现在已经合并了。 - Tom

@Waldi 已修复。非常抱歉，我一直忽略了错误。我想我在某个时候复制了一个错误的表格，这使我混淆了数字。我再次检查了整个表格，现在应该是正确的了。 - Tom

3个回答

5

也许您可以创建一个如下的helper函数：

helper <- function(v) {
  s <- grp <- 0
  y <- vector("numeric", length(v))
  for (i in seq_along(v)) {
    y[i] <- grp
    s <- s + v[i]
    if (s >= 2) {
      s <- 0
      grp <- grp + 1
    }
  }
  y
}

然后运行

dt <- sample[
  ,
  c(.(grp = helper(observations)), .SD),
  .(id = rleid(cat))
][
  ,
  `:=`(type = last(type), observations = sum(observations)),
  .(id, grp)
][, -(1:2)]

你将会获得：

> dt
    A B C  D   type observations
 1: 1 0 2 NA type 2            4
 2: 3 4 3  1 type 2            4
 3: 1 0 2  2 type 3            3
 4: 3 4 3  0 type 4            2
 5: 1 0 2 NA type 3            3
 6: 3 4 3 NA type 3            3
 7: 1 0 2  2 type 3            3
 8: 3 4 3 35 type 4            3
 9: 1 0 2 NA type 2            4
10: 3 4 3  1 type 2            4
11: 1 0 2  2 type 3            3
12: 3 4 3 NA type 4            2
13: 1 0 2 NA type 3            3
14: 3 4 3 NA type 3            3
15: 1 0 2  2 type 3            3
16: 3 4 3  1 type 4            3
17: 1 0 2  4 type 2            4
18: 3 4 3  1 type 2            4
19: 1 0 2  2 type 3            3
20: 3 4 3  2 type 4            2
21: 1 0 2 NA type 3            3
22: 3 4 3 NA type 3            3
23: 1 0 2  2 type 3            3
24: 3 4 3  2 type 4            3
    A B C  D   type observations

- ThomasIsCoding

1

将 idx <- c(idx, ....) 追加的方式效率较低。请使用 idx <- vector(, length(v)) 声明并在循环中填充。 - jangorecki

@jangorecki 谢谢！加速的好建议。我已经更新了我的解决方案！ - ThomasIsCoding

1

如果是整数，那么在您的辅助函数中使用double而不是整数会浪费一些时间：+1 vs +1L。 - jangorecki

@jangorecki 听起来很有趣！之前不知道 integer 和 numeric 在速度上的区别，现在学到了。 - ThomasIsCoding

@ThomasIsCoding 我在将你的解决方案应用到我的实际数据时遇到了一些问题。我成功重现了这个问题并发布了一个新的问题：https://dev59.com/8nsPtIcB2Jgan1zn-T-j我使用了Waldi的答案来举例，但是我使用你的函数得到了相同的结果。如果你有任何想法，我会非常乐意听取。 - Tom

3

这里提供一个tidyverse的解决方案。当观测值小于2时，最高类型级别（类型4）将降低，其他类型将上升一个级别。

library(dplyr)
sample %>% 
  mutate(
    new_type = as.numeric(factor(type)),
    new_type = paste0(
      "type ", 
      ifelse(observations<2,
             ifelse(new_type != max(new_type), new_type + 1, new_type - 1), 
             new_type)
      )
  ) %>% 
  group_by(cat, new_type) %>% 
  mutate(observations = sum(!is.na(D))) %>% 
  ungroup()

- Leon Samson

这不符合期望的输出，因为它不会移动超过一个类型以上/以下-请参见第5行的示例。 - David Arenburg

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Waldi · Accepted Answer

使用Reduce 和 accumulate = T选项：

sample[,`:=`(type = last(type),observations=sum(observations)),
       .(cat,sapply(Reduce(f = function(x,y) {
                          grp= x$grp
                          if (x$nxtgrp) {grp=grp+1; x$cumsum=0}
                          nxtgrp=!((x$cumsum+y)<2)
                          list(grp = grp,
                               cumsum=x$cumsum + y,
                               nxtgrp = nxtgrp)},
                     x = observations,
                     init = list(grp = 0, cumsum=0, nxtgrp = F),
                     accumulate = T),
         function(x) x$grp)[-1])
       ][]

        A     B     C     D    cat   type observations
    <int> <int> <int> <int> <char> <char>        <int>
 1:     1     0     2    NA  cat X type 2            4
 2:     3     4     3     1  cat X type 2            4
 3:     1     0     2     2  cat X type 3            3
 4:     3     4     3     0  cat X type 4            2
 5:     1     0     2    NA  cat Y type 3            3
 6:     3     4     3    NA  cat Y type 3            3
 7:     1     0     2     2  cat Y type 3            3
 8:     3     4     3    35  cat Y type 4            3
 9:     1     0     2    NA  cat X type 2            4
10:     3     4     3     1  cat X type 2            4
11:     1     0     2     2  cat X type 3            3
12:     3     4     3    NA  cat X type 4            2
13:     1     0     2    NA  cat Y type 3            3
14:     3     4     3    NA  cat Y type 3            3
15:     1     0     2     2  cat Y type 3            3
16:     3     4     3     1  cat Y type 4            3
17:     1     0     2     4  cat X type 2            4
18:     3     4     3     1  cat X type 2            4
19:     1     0     2     2  cat X type 3            3
20:     3     4     3     2  cat X type 4            2
21:     1     0     2    NA  cat Y type 3            3
22:     3     4     3    NA  cat Y type 3            3
23:     1     0     2     2  cat Y type 3            3
24:     3     4     3     2  cat Y type 4            3
        A     B     C     D    cat   type observations

该想法是生成一个包含以下内容的累积列表：

当前组：grp
当前累加和：cumsum
下一行要增加组的标志：nxtgrp

一旦观测值数量超过2，就设置了增加组的标志。
当标志被设置时，在下一行，cumsum被重置为零，grp被递增。

然后可以将grp列表元素用作data.table中的by参数。

另一个可能实现相同分组的方法是使用for-loop函数，无论是在R中还是在Rcpp中:

observations_grp <- function(x) {
  cumsum_i <- 0
  nxtgrp <-  F
  n <- length(x)
  grp <- rep(0,n)
  grp_i <- 0;
  for (i in 1:n) {
    if (nxtgrp) {grp_i <- grp_i + 1; cumsum_i <- 0;}
    nxtgrp <- !((cumsum_i + x[i]) < 2)
    cumsum_i <- cumsum_i + x[i]
    grp[i] <- grp_i
  }
  grp
}

sample[,`:=`(type = last(type), observations=sum(observations)),
        .(cat,observations_grp(observations))
][]

性能比较表明，Reduce 比 R 循环速度慢。

Unit: milliseconds
   expr    min      lq     mean  median      uq    max neval
 Reduce 1.3458 1.45025 1.732185 1.56405 1.73740 6.3339   100
   Loop 1.3374 1.44175 1.685722 1.53120 1.67665 3.7091   100

如果您需要速度，Rcpp肯定会大有帮助。