如何使用R data.table按组计算分类变量的频率/表？

Question

如何使用R data.table按组计算分类变量的频率/表？

3

我有一个用R创建的data.table，内容如下：

library(data.table)
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2", ...), category = c("red", "red", "blue", "red", "red", "blue", "green", "green", ...))

dt
ID         category
person1    red
person1    red
person1    blue
person2    red
person2    red
person2    blue
person2    green
person2    green
person3    blue
....

我想知道如何为每个唯一的ID创建分类变量 red、blue、green 的“频率”，然后将这些列扩展以记录每个计数。结果数据表应如下所示：

dt
ID        red    blue    green
person1   2      1       0
person2   2      1       2    
...

我曾错误地认为使用 data.table 开始的正确方式是按组计算 table()，例如：

dt[, counts :=table(category), by=ID]

但是这似乎是按组ID计算分类值的总数。这也不能解决我“扩展”数据表的问题。

正确的做法是什么？

- ShanZhengYang

3个回答

2

您可以使用reshape库来实现一行代码。

library(reshape2)
dcast(data=dt,
      ID ~ category,
      fun.aggregate = length,
      value.var = "category")

       ID blue green red
1 person1    1     0   2
2 person2    1     2   2

此外，如果您只需要一个简单的双向表格，可以使用内置的R table函数。

table(dt$ID,dt$category)

- akaDrHouse

1

这是以命令式风格完成的，可能有一种更清晰、函数式的方法来完成它。

library(data.table)
library(dtplyr)
dt = data.table(ID = c("person1", "person1", "person1", "person2", "person2", "person2", "person2", "person2"), 
                category = c("red", "red", "blue", "red", "red", "blue", "green", "green"))


ids <- unique(dt$ID)
categories <- unique(dt$category)
counts <- matrix(nrow=length(ids), ncol=length(categories))
rownames(counts) <- ids
colnames(counts) <- categories

for (i in seq_along(ids)) {
  for (j in seq_along(categories)) {
    count <- dt %>%
      filter(ID == ids[i], category == categories[j]) %>%
      nrow()

    counts[i, j] <- count
  }
}

然后：

>counts
##         red blue green
##person1   2    1     0
##person2   2    1     2

- Julian Zucker

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- amatsuo_net · Accepted Answer

像这样吗？

library(data.table)
library(dplyr)
dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category)

如果您想将这些列添加到原始的data.table中。

counts <- dt[, .N, by = .(ID, category)] %>% dcast(ID ~ category) 
counts[is.na(counts)] <- 0
output <- merge(dt, counts, by = "ID")