在R中,通过另一列的成对组合,并按第三列分组,计算一列的唯一值数量。

5

说实话,这是一项相当复杂的任务。基本上它是我之前提出的一个问题的延伸 - 在R中通过另一列的成对组合计算一列的唯一值数量

假设这一次,在R中我有以下数据框:

data.frame(Reg.ID = c(1,1,2,2,2,3,3), Location = c("X","X","Y","Y","Y","X","X"), Product = c("A","B","A","B","C","B","A"))

数据看起来像这样 -
      Reg.ID Location Product
1      1        X       A
2      1        X       B
3      2        Y       A
4      2        Y       B
5      2        Y       C
6      3        X       B
7      3        X       A

我想通过列“产品”中的值的成对组合,以列“位置”分组的方式计算列“Reg.ID”的唯一值。结果应如下所示 -
  Location Prod.Comb Count
1        X       A,B     2
2        Y       A,B     1
3        Y       A,C     1
4        Y       B,C     1

我尝试使用基本的R函数获取输出,但没有成功。我猜想在R中使用data.table包有一个相当简单的解决方案?

如果能提供帮助,将不胜感激。谢谢!

2个回答

6

这只是一个经过少量测试的想法,但这是我在使用 data.table 时首先想到的:

library(data.table)
dt <- data.table(Reg.ID = c(1,1,2,2,2,3,3), Location = c("X","X","Y","Y","Y","X","X"), Product = c("A","B","A","B","C","B","A"))
dt.cj <- merge(dt, dt, by ="Location", all = T, allow.cartesian = T)
dt.res <- dt.cj[Product.x < Product.y, .(cnt = length(unique(Reg.ID.x))),by = .(Location, Product.x, Product.y)]


#    Location Product.x Product.y cnt
# 1:        X         A         B  2
# 2:        Y         A         B  1
# 3:        Y         A         C  1
# 4:        Y         B         C  1

4
类似的方式是:dt[order(Product), CJ(Product, Product)[V1 < V2], by=.(Location, Reg.ID)][, .N, by=.(Location, V1, V2)]。我认为,CJ类似于您的笛卡尔积操作。 - Frank
在处理大数据集时效果不佳,但在子集上表现良好。 - sharmanas

2

一个基于dplyr的解决方案,抄袭自您提到的问题:

library(dplyr)

df <- data.frame(Reg.ID = c(1,1,2,2,2,3,3), 
                 Location = c("X","X","Y","Y","Y","X","X"), 
                 Product = c("A","B","A","B","C","B","A"),
                 stringsAsFactors = FALSE)

df %>%
  full_join(df, by="Location") %>%
  filter(Product.x < Product.y) %>%
  group_by(Location, Product.x, Product.y) %>%
  summarise(Count = length(unique(Reg.ID.x))) %>%
  mutate(Prod.Comb = paste(Product.x, Product.y, sep=",")) %>%
  ungroup %>%
  select(Location, Prod.Comb, Count) %>%
  arrange(Location, Prod.Comb)

# # A tibble: 4 × 3
#   Location Prod.Comb Count
#      <chr>     <chr> <int>
# 1        X       A,B     2
# 2        Y       A,B     1
# 3        Y       A,C     1
# 4        Y       B,C     1

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接