使用data.table按周期对重复数据进行分组

6

我有一个数据集,其中包含姓名、日期和几个分类列。假设:

data <- data.table(name = c('Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Ben', 'Ben', 'Cal'),
               period = c(1,1,1,1,1,1,2,2,2,3,3), 
               category = c("A","A","A","B","B","B","A","B","A","B","A"))

这看起来像这样:

  name  period  category
  Anne       1         A
   Ben       1         A
   Cal       1         A
  Anne       1         B
   Ben       1         B
   Cal       1         B
  Anne       2         A
   Ben       2         B
   Ben       2         A
   Ben       3         A
   Cal       3         B

我希望能够计算每个时期每个分类变量组中过去时期有多少个名称。输出应按以下方式进行:

period  category  recurrence_count
    2         A                 2   # due to Anne and Ben being on A, period 1
    2         B                 1   # due to Ben being on B, period 1
    3         A                 1   # due to Ben being on A, period 2 
    3         B                 0   # no match from B, period 2

我知道 data.table 中的 .I 和 .GRP 运算符,但我不知道如何在语句的 j 条目中编写“下一个组”的概念。我想像这样做可能是一个合理的路径,但我无法想出正确的语法:

data[, .(recurrence_count = length(intersect(name, name[last(.GRP)]))), by = .(category, period)]
4个回答

4
你可以首先按照类别和时间段对数据进行总结。
previous_period_names <- data[, .(names = list(name)), .(category, period)]

previous_period_names[, next_period := period + 1]

将您的摘要与原始数据合并。

data[previous_period_names, names := i.names, on = c('period==next_period')]

现在数一数在汇总的名字中你看到了多少次这个名字。
data[, .(recurrence_count = sum(name %in% unlist(names))), by = .(period, category)]

2

另一种 data.table 的替代方案。对于可以有前一期的行(period != 1),创建这样一个变量(prev_period := period - 1)。

将原始数据与具有“prev_period”值的子集连接 (data[data[!is.na(prev_period)])。按照“category”、“period = prev_period”和“name”进行连接。

在结果数据集中,对于每个“period”和“category”(by = .(period = i.period, category)),计算原始数据中与上一期匹配的名称数目 (length(na.omit(x.name)))。

data[period != 1, prev_period := period - 1]

data[data[!is.na(prev_period)], on = c("category", period = "prev_period", "name"),
     .(category, i.period, x.name)][
       , .(n = length(na.omit(x.name))), by = .(period = i.period, category)]

#    period category n
# 1:      2        A 2
# 2:      2        B 1
# 3:      3        B 1
# 4:      3        A 0

1
谢谢你,Hendrick!这非常聪明,而且内存利用率也很高! - pheymanss

1

base R 中的一种选项是通过 'category' 将 'data' 进行 split,然后循环遍历 list(使用 lapply),对分割后的 'name' 按 'period' 使用 intersectaccumulate(为真),获取 list 的长度,创建一个 data.frame,其中包含 'period' 的唯一元素,并使用 Maplist 输出的 names 创建 'category',将 list 的数据框进行 rbind,合并为单个数据集。

library(data.table)
lst1 <- lapply(split(data, data$category), function(x) 
   data.frame(period = unique(x$period)[-1], 
   recurrence_count = lengths(Reduce(intersect, 
           split(x$name, x$period), accumulate = TRUE)[-1])))
rbindlist(Map(cbind, category = names(lst1), lst1))[
      order(period), .(period, category, recurrence_count)]
#     period category recurrence_count
#1:      2        A                2
#2:      2        B                1
#3:      3        A                1
#4:      3        B                0

或者使用相同的逻辑在 data.table 中,按 'category' 进行分组,通过 'period' 将 'name' 进行 split 并使用 Reduceintersect 进行处理。

setDT(data)[, .(period = unique(period), 
    recurrence_count = lengths(Reduce(intersect, 
    split(name, period), accumulate = TRUE))), .(category)][duplicated(category)]
#   category period recurrence_count
#1:        A      2                2
#2:        A      3                1
#3:        B      2                1
#4:        B      3                0

或在 tidyverse 中寻找类似的选项

library(dplyr)
library(purrr)
data %>% 
   group_by(category) %>% 
   summarise(reccurence_count = lengths(accumulate(split(name, period),
        intersect)), period = unique(period), .groups = 'drop' ) %>%  
   filter(duplicated(category))
# A tibble: 4 x 3
#  category reccurence_count period
#  <chr>               <int>  <int>
#1 A                       2      2
#2 A                       1      3
#3 B                       1      2
#4 B                       0      3

数据

data <- structure(list(name = c("Anne", "Ben", "Cal", "Anne", "Ben", 
"Cal", "Anne", "Ben", "Ben", "Ben", "Cal"), period = c(1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), category = c("A", "A", "A", 
"B", "B", "B", "A", "B", "A", "A", "B")), class = "data.frame",
row.names = c(NA, 
-11L))

1
一个 data.table 选项
setDT(df)[
  ,
  {
    u <- split(name, period)
    data.table(
      period = unique(period)[-1],
      recurrence_count = lengths(
        Map(
          intersect,
          head(u, -1),
          tail(u, -1)
        )
      )
    )
  },
  category
]

提供

   category period recurrence_count
1:        A      2                2
2:        A      3                1
3:        B      2                1
4:        B      3                0

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接