Spread和dcast的区别

7

我有一个像这样的表格:

> head(dt2)
  Weight Height   Fitted interval limit    value
1   65.6  174.0 71.91200     pred   lwr 53.73165
2   80.7  193.5 91.63237     pred   lwr 73.33198
3   72.6  186.5 84.55326     pred   lwr 66.31751
4   78.8  187.2 85.26117     pred   lwr 67.02004
5   74.8  181.5 79.49675     pred   lwr 61.29244
6   86.4  184.0 82.02501     pred   lwr 63.80652

我希望它像这样,
> head(reshape2::dcast(dt2, 
         Weight + Height + Fitted + interval ~ limit, 
         fun.aggregate = mean))
  Weight Height   Fitted interval      lwr      upr
1   42.0  153.4 51.07920     conf 49.15463 53.00376
2   42.0  153.4 51.07920     pred 32.82122 69.33717
3   43.2  160.0 57.75378     conf 56.35240 59.15516
4   43.2  160.0 57.75378     pred 39.54352 75.96404
5   44.8  149.5 47.13512     conf 44.87642 49.39382
6   44.8  149.5 47.13512     pred 28.83891 65.43133

使用tidyr::spread,我该如何做到这一点?
我曾经使用过:
> tidyr::spread(dt2, limit, value)

但出现错误提示:

Error: Duplicate identifiers for rows (1052, 1056), (238, 242), (1209, 1218), (395, 404), (839, 1170), (25, 356), (1173, 1203, 1215), (359, 389, 401), (1001, 1200), (187, 386), (906, 907), (92, 93), (930, 1144), (116, 330), (958, 1171), (144, 357), (902, 1018), (88, 204), (960, 1008), (146, 194), (1459, 1463), (645, 649), (1616, 1625), (802, 811), (1246, 1577), (432, 763), (1580, 1610, 1622), (766, 796, 808), (1408, 1607), (594, 793), (1313, 1314), (499, 500), (1337, 1551), (523, 737), (1365, 1578), (551, 764), (1309, 1425), (495, 611), (1367, 1415), (553, 601)

随机选择10行:

> dt[sample(nrow(dt), 10), ]
     Weight Height   Fitted interval limit    value
1253   52.2  162.5 60.28203     conf   upr 61.51087
426    49.1  158.8 56.54022     pred   upr 74.75756
1117   78.4  184.5 82.53066     conf   lwr 80.98778
1171   85.9  166.4 64.22611     conf   lwr 63.21254
948    61.4  177.8 75.75494     conf   lwr 74.66393
384    90.9  172.7 70.59731     pred   lwr 52.41828
289    75.9  172.7 70.59731     pred   lwr 52.41828
3      44.8  149.5 47.13512     pred   lwr 28.83891
774    87.3  182.9 80.91258     pred   upr 99.12445
772    86.4  175.3 73.22669     pred   upr 91.40919

你的示例中在“limit”中不包含“upr”,也不在“interval”中包含“conf”,这意味着你期望的结果无法重现。 - mtoto
为什么不保持长格式并进行聚合呢?请参见此处的示例,使用基本R、dplyrdata.table - Jaap
虽然我已经用dcast做过了,但我想使用tidyr来学习一些东西。@mtoto 这只是我的数据集的头部,我会编辑它以给你一个随机样本,以便复现。 - TheRimalaya
那个按照区间和限制进行了总结,只给了我两行。 - TheRimalaya
你的 tidyrdcast 代码在示例数据中给出了相同的输出。 - mtoto
显示剩余2条评论
2个回答

13

假设你开始的数据长这样:

mydf
#   Weight Height  Fitted interval limit    value
# 1     42  153.4 51.0792     conf   lwr 49.15463
# 2     42  153.4 51.0792     pred   lwr 32.82122
# 3     42  153.4 51.0792     conf   upr 53.00376
# 4     42  153.4 51.0792     pred   upr 69.33717
# 5     42  153.4 51.0792     conf   lwr 60.00000
# 6     42  153.4 51.0792     pred   lwr 90.00000

注意组合列(1到5)的行5和6中的重复。这基本上就是“tidyr”要告诉你的。第一行和第五行是重复的,第二行和第六行也是。

tidyr::spread(mydf, limit, value)
# Error: Duplicate identifiers for rows (1, 5), (2, 6)

如@Jaap所建议的那样,解决方案是先“概括”数据。由于“tidyr”仅用于重塑数据(不像“reshape2”那样同时汇总和重塑),因此您需要在使用“tidyr”之前使用“dplyr”执行聚合操作以更改数据形式。这里,我使用summarise对“value”列进行了聚合。

如果您在 summarise 步骤处停止执行,则会发现我们原始的6行数据集已经“缩小”到4行。现在,spread将按预期工作。

mydf %>% 
  group_by(Weight, Height, Fitted, interval, limit) %>% 
  summarise(value = mean(value)) %>% 
  spread(limit, value)
# Source: local data frame [2 x 6]
# 
#   Weight Height  Fitted interval      lwr      upr
#    (dbl)  (dbl)   (dbl)    (chr)    (dbl)    (dbl)
# 1     42  153.4 51.0792     conf 54.57731 53.00376
# 2     42  153.4 51.0792     pred 61.41061 69.33717

这与使用dcastfun.aggregate = mean得到的预期输出相匹配。

reshape2::dcast(mydf, Weight + Height + Fitted + interval ~ limit, fun.aggregate = mean)
#   Weight Height  Fitted interval      lwr      upr
# 1     42  153.4 51.0792     conf 54.57731 53.00376
# 2     42  153.4 51.0792     pred 61.41061 69.33717

样本数据:

 mydf <- structure(list(Weight = c(42, 42, 42, 42, 42, 42), Height = c(153.4, 
     153.4, 153.4, 153.4, 153.4, 153.4), Fitted = c(51.0792, 51.0792,         
     51.0792, 51.0792, 51.0792, 51.0792), interval = c("conf", "pred",        
     "conf", "pred", "conf", "pred"), limit = structure(c(1L, 1L,             
     2L, 2L, 1L, 1L), .Label = c("lwr", "upr"), class = "factor"),            
         value = c(49.15463, 32.82122, 53.00376, 69.33717, 60,          
         90)), .Names = c("Weight", "Height", "Fitted", "interval",     
     "limit", "value"), row.names = c(NA, 6L), class = "data.frame")   

谢谢!我在思考如何处理聚合函数。我认为Hadely希望tidyrdplyr一起使用。 - TheRimalaya
这是一个很好的回答,让我明白了dcastspread之间的区别。谢谢! - Mikko

1
这里是使用data.table替代dplyr的选择。使用安达答案中的mydf
library(data.table)
library(magrittr)
library(tidyr)

DT <- data.table(mydf)

首先,您可以使用 by 按每个限制计算平均值。

DT[, .(lwr = mean(value[limit == "lwr"]), 
       upr = mean(value[limit == "upr"])), 
   by = .(Weight, Height, Fitted, interval)]

如果这个limit == ...看起来太过于硬编码,你可以先聚合成长格式,然后再进行spread。这样做的原因是一旦聚合,就不会有重复的内容。
DT[, .(value = mean(value)), by = .(Weight, Height, Fitted, interval, limit)] %>%
  spread(key = "limit", value = "value")

两者都能让你得到

#   Weight Height  Fitted interval      lwr      upr
#1:     42  153.4 51.0792     conf 54.57731 53.00376
#2:     42  153.4 51.0792     pred 61.41061 69.33717

谢谢,实际上我在谈论 dplyrtidyr。我已经用 reshape2 解决了问题,但我想知道如何使用这些特定的包来解决它。无论如何还是谢谢! - TheRimalaya

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接