多变量汇总数据框。

4

我有以下提到的数据框:

ID        Date            Status         Category
TR-1      2018-01-10      Passed         A
TR-2      2018-01-09      Passed         B
TR-3      2018-01-09      Failed         C
TR-3      2018-01-09      Failed         A
TR-4      2018-01-08      Failed         B
TR-5      2018-01-08      Passed         C
TR-5      2018-01-08      Failed         A
TR-6      2018-01-07      Passed         A

利用上述数据框,我想要以下格式的输出结果:
日期应按降序排列,类别顺序应为 C、A 和 B。
Date         count      distinct_count      Passed     Failed
2018-01-10   1          1                   1          0
    A        1          1                   1          0
    B        0          0                   0          0
    C        0          0                   0          0
2018-01-09   3          2                   1          2
    A        1          1                   1          0
    B        1          1                   1          0
    C        1          1                   1          0

为了得到上述输出,我尝试了以下代码,但它不能正常工作,无法得到期望的输出。
Output<-DF %>%
  group_by(Date=Date,A,B,C) %>%
  summarise(`Count`  = n(),
            `Distinct_count` = n_distinct(ID),
            Passed=sum(Status=='Passed'),
            A=count(category='A'),
            B=count(category='B'),
            C=count(category='C'),
            Failed=sum(Status=='Failed'))

Dput:

structure(list(ID = structure(c(1L, 2L, 3L, 3L, 4L, 5L, 5L, 6L
), .Label = c("TR-1", "TR-2", "TR-3", "TR-4", "TR-5", "TR-6"), class = "factor"), 
    Date = structure(c(4L, 3L, 3L, 3L, 2L, 2L, 2L, 1L), .Label = c("07/01/2018", 
    "08/01/2018", "09/01/2018", "10/01/2018"), class = "factor"), 
    Status = structure(c(2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L), .Label = c("Failed", 
    "Passed"), class = "factor"), Category = structure(c(1L, 
    2L, 3L, 1L, 2L, 3L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("ID", 
"Date", "Status", "Category"), class = "data.frame", row.names = c(NA, 
-8L))

一个初始数据框的“dput()”输出会有助于重新创建问题。 - runr
你提出的输出结果很凌乱,我认为这样的输出结果会让后续的工作变得困难。你确定不想用group_by(Date, Category) %>% 来替换你的第二行代码生成输出结果吗?然后你可以再对第二个表按Date进行分组汇总,以获取日期计数,并使用left_join将其与第一个表连接,以获得一个额外的列来指示日期计数... - NRLP
@Luminita:您建议按日期和类别创建两个不同的输出组,然后进行left_join吗??? - Vector JX
现在我更仔细地查看了你的 summarise,我认为它不会按照你的建议工作。请按 Nutle 的建议添加你初始数据的输出。 - NRLP
1
ID是如何发挥作用的? - Andre Elrico
显示剩余2条评论
5个回答

6

这是一件棘手的事情:

# I'm converting some variables to factors to get the "order" right and to fill in missing unobserved values later in dcast.
df1$Category <- factor(df1$Category, levels = unique(df1$Category))
date_lvls    <- as.Date(df1$Date, "%Y-%m-%d") %>% unique %>% sort(decreasing = TRUE) %>% as.character
df1$Date     <- factor(df1$Date, date_lvls)

# lets use data.table
library(data.table)
setDT(df1)

# make a lookup table to deal with the duplicated ID issue. Not sure how to do this elegant
tmp <- dcast.data.table(df1, Date ~ ID, fun.aggregate = length)
tmp <- structure(rowSums(tmp[,-1] == 2), .Names = as.character(unlist(tmp[, 1])))

# precaution! Boilerplate incoming in 3, 2, .. 1
dcast.data.table(df1, Date + Category ~ Status, drop = FALSE)[
    ,`:=`(Failed=+!is.na(Failed), Passed=+!is.na(Passed))][
    , c("count","distinct_count") := rowSums(cbind(Failed,Passed))][
    , Category := as.character(Category)][
    , rbind(
        cbind(Category = as.character(Date[1]), count = sum(count), distinct_count = sum(distinct_count) - tmp[as.character(Date[1])], Passed = sum(Passed), Failed = sum(Failed)),
        .SD
       , fill = TRUE), by = Date][
    , Date := NULL ][]

result:

 #     Category count distinct_count Passed Failed
 #1: 2018-01-10     1              1      1      0
 #2:          A     1              1      1      0
 #3:          B     0              0      0      0
 #4:          C     0              0      0      0
 #5: 2018-01-09     3              2      1      2
 #6:          A     1              1      0      1
 #7:          B     1              1      1      0
 #8:          C     1              1      0      1
 #9: 2018-01-08     3              2      1      2
#10:          A     1              1      0      1
#11:          B     1              1      0      1
#12:          C     1              1      1      0
#13: 2018-01-07     1              1      1      0
#14:          A     1              1      1      0
#15:          B     0              0      0      0
#16:          C     0              0      0      0

数据:

df1<-
structure(list(ID = c("TR-1", "TR-2", "TR-3", "TR-3", "TR-4", 
"TR-5", "TR-5", "TR-6"), Date = c("2018-01-10", "2018-01-09", 
"2018-01-09", "2018-01-09", "2018-01-08", "2018-01-08", "2018-01-08", 
"2018-01-07"), Status = c("Passed", "Passed", "Failed", "Failed", 
"Failed", "Passed", "Failed", "Passed"), Category = c("A", "B", 
"C", "A", "B", "C", "A", "A")), row.names = c(NA, -8L), class = "data.frame")

请注意:
  • 请逐行运行代码。为此,请关闭每个已打开的结尾括号并运行该行到结束:例如

    1. 运行:dcast.data.table(df1, Date + Category ~ Status, drop = FALSE)[]

    2. 运行:dcast.data.table(df1, Date + Category ~ Status, drop = FALSE)[ ,:=(Failed=+!is.na(Failed), Passed=+!is.na(Passed))][]

    3. ……一直到结束

    4. 如果有任何不清楚的地方,请就这件具体事情问我。


出现错误 将'Category'用作值列。请使用'value.var'进行覆盖 - Vector JX
这不是一个错误,而是一个警告。你可以明确地设置值.var = “Category”。 - Andre Elrico
现在出现了“缺少聚合函数,默认为'length'”的错误 :( - Vector JX
再次提醒,这不是错误,而是警告。你之前也遇到过这个警告。 - Andre Elrico
仍然出现相同的错误 聚合函数缺失,默认为'length' - Vector JX

4

我确定一定有更优雅的解决方案,但是使用tidyverse您可以这样做:

bind_rows(df %>%
           arrange(Date) %>%
           group_by(Date, Category) %>%
           summarise(count = n(),
                     distinct_count = n_distinct(ID),
                     passed = length(Status[Status == "Passed"]),
                     failed = length(Status[Status == "Failed"])) %>% 
           complete(Category) %>% 
           mutate_all(funs(coalesce(., 0L))) %>%
           ungroup() %>%
           mutate(Date = Category,
                  date_id = gl(nrow(.)/3, 3)) %>%
           select(-Category), df %>%
           arrange(Date) %>%
           group_by(Date) %>%
           summarise(count = n(),
                     distinct_count = n_distinct(ID),
                     passed = length(Status[Status == "Passed"]),
                     failed = length(Status[Status == "Failed"])) %>%
           mutate(date_id = gl(nrow(.), 1))) %>%
 arrange(date_id, Date)

   Date       count distinct_count passed failed date_id
   <chr>      <int>          <int>  <int>  <int> <fct>  
 1 07/01/2018     1              1      1      0 1      
 2 A              1              1      1      0 1      
 3 B              0              0      0      0 1      
 4 C              0              0      0      0 1      
 5 08/01/2018     3              2      1      2 2      
 6 A              1              1      0      1 2      
 7 B              1              1      0      1 2      
 8 C              1              1      1      0 2      
 9 09/01/2018     3              2      1      2 3      
10 A              1              1      0      1 3      
11 B              1              1      1      0 3      
12 C              1              1      0      1 3      
13 10/01/2018     1              1      1      0 4      
14 A              1              1      1      0 4      
15 B              0              0      0      0 4      
16 C              0              0      0      0 4 

首先,它基于“日期”和“类别”创建了一个包含计数、去重计数、通过和失败列的df。其次,通过使用complete(),它生成了“类别”中的所有级别,然后使用coalesce()将不存在的级别填充为0。第三,它基于仅“日期”创建了第二个包含计数、去重计数、通过和失败列的df。最后,它通过行合并了这两个df。
样本数据:
df <- read.table(text = "ID        Date            Status         Category
TR-1      2018-01-10      Passed         A
                 TR-2      2018-01-09      Passed         B
                 TR-3      2018-01-09      Failed         C
                 TR-3      2018-01-09      Failed         A
                 TR-4      2018-01-08      Failed         B
                 TR-5      2018-01-08      Passed         C
                 TR-5      2018-01-08      Failed         A
                 TR-6      2018-01-07      Passed         A", header = TRUE)

@tmfnk:谢谢,但我想让A、B和C类别对于所有日期保持不变,无论该类别在特定日期是否有条目,如果该类别不可用,则值为0。 - Vector JX
尽管所有列都是字符类型,但出现了类似于“Error in eval(substitute(expr), envir, enclos) : Argument 2 must be type character, not integer”的错误。 - Vector JX
我在我的帖子中添加了示例数据。它能正常工作吗? - tmfmnk
是的,在样本数据上它可以正常工作。但是我在这一行上遇到了错误:dplyr::mutate_all(funs(coalesce(., 0L))) - Vector JX
mutate_at(3:6, funs(ifelse(is.na(.), 0, .)))是什么意思? - tmfmnk
显示剩余4条评论

4
混合使用像$Date$Category这样的变量放在同一列中是一个不好的想法,因为正如@Luminata所指出的那样,它会使数据的进一步处理非常困难。
虽然您的目标并不是很明确,因此任何答案都必须是暂定的,但这里有一个解决方案,可能会让您更接近您的目标:
如果这是您的数据:
df <- data.frame(
  ID = c("TR-1","TR-2", "TR-3", "TR-3", "TR-4", "TR-5", "TR-5", "TR-6"),       
  Date = c("2018-01-10", "2018-01-09", "2018-01-09", "2018-01-09", "2018-01-08", "2018-01-08", "2018-01-08", "2018-01-07"),            
  Status = c("Passed","Passed","Failed","Failed","Failed","Passed","Failed", "Passed"),         
 Category = c("A","B","C","A","B","C","A","A")
)

如果您想按$Date拆分数据,为什么不使用byunique函数创建每个日期的可分离数据帧列表:

df_list <- by(df, df$Date, function(unique) unique)
df_list
df$Date: 2018-01-07
    ID       Date Status Category
8 TR-6 2018-01-07 Passed        A
------------------------------------------------------------------------------------------ 
df$Date: 2018-01-08
    ID       Date Status Category
5 TR-4 2018-01-08 Failed        B
6 TR-5 2018-01-08 Passed        C
7 TR-5 2018-01-08 Failed        A
------------------------------------------------------------------------------------------ 
df$Date: 2018-01-09
    ID       Date Status Category
2 TR-2 2018-01-09 Passed        B
3 TR-3 2018-01-09 Failed        C
4 TR-3 2018-01-09 Failed        A
------------------------------------------------------------------------------------------ 
df$Date: 2018-01-10
    ID       Date Status Category
1 TR-1 2018-01-10 Passed        A

你建议使用 by 作为答案吗? - Andre Elrico
是的,一个暂定的。为什么? - Chris Ruehlemann

2
您可以使用lapply在您想要使用的两列的不同级别上混合,再与do.call("rbind",x)混合,将其作为数组返回。
像这样做:
```R result <- do.call("rbind", lapply(split(df[c("col1", "col2")], df$group), function(x) { # your code here })) ```
res=do.call("rbind",lapply(levels(DF$Date),function(d)do.call("rbind",lapply(levels(DF$Category),function(c)
                                                                            {
                                                                                tbl=table(DF$Status[DF$Category == c & DF$Date == d])
                                                                                cbind(Date=d,Category=c,count=sum(tbl),distinct_count=sum(tbl>0),t(tbl))
                                                                            }))))
res=as.data.frame(res)

我添加了几行数据到数据集中,所以输入帧应该是:
DF <- read.table(text =
"fD    Date    Status    Category
TR-1    2018-01-10    Passed    A
TR-2    2018-01-09    Passed    B
TR-3    2018-01-09    Failed    C
TR-4    2018-01-09    Failed    A
TR-5    2018-01-08    Failed    B
TR-6    2018-01-08    Passed    C
TR-7    2018-01-08    Failed    A
TR-8    2018-01-08    Passed    B
TR-9    2018-01-08    Failed    A
TR-10    2018-01-08    Failed    A
TR-11    2018-01-07    Passed    A"
, header = TRUE)

第一行代码将输出:
> res
         Date Category count distinct_count Failed Passed
1  2018-01-07        A     1              1      0      1
2  2018-01-07        B     0              0      0      0
3  2018-01-07        C     0              0      0      0
4  2018-01-08        A     3              1      3      0
5  2018-01-08        B     2              2      1      1
6  2018-01-08        C     1              1      0      1
7  2018-01-09        A     1              1      1      0
8  2018-01-09        B     1              1      0      1
9  2018-01-09        C     1              1      1      0
10 2018-01-10        A     1              1      0      1
11 2018-01-10        B     0              0      0      0
12 2018-01-10        C     0              0      0      0

编辑:我想我终于猜到了你所说的“distinct count”的含义,因此我更新了答案。


1
这个答案对我来说似乎是最简洁的,同时还利用了高效的工具。+1 - kana

0

正如其他人所指出的那样,在一个列中混合您的变量可能不是最好的想法,但我只是简单地将行组合在一起:

 library(tidyr)
 library(dplyr)
 Output <- DF %>%
   group_by(Date, Category) %>%
   summarise('Count'=n(),
             'Distinct_Count'=n_distinct(ID),
             Passed=sum(Status=='Passed'),
             Failed=sum(Status=='Failed')) %>%
   ungroup() %>%
   complete(Date, Category, fill=list(Count=0, Distinct_Count=0, Passed=0, Failed=0))


 perDay <- Output %>% 
   group_by(Date) %>%
   summarise('Count'=sum(Count),
             'Distinct_Count'=sum(Distinct_Count),
             Passed=sum(Passed),
             Failed=sum(Failed)) %>%
   arrange(desc(Date))

 Output$indate <- Output$Date
 Output$Date <- Output$Category
 Combined <- bind_rows(lapply(perDay$Date, function(date) {
   rbind(perDay[perDay$Date==date,], Output[Output$indate==date,c(1,3:6)])
 }))

数据框 perDay 和每个类别的输出计数值(必要时进行补全),仅在稍后按天绑定在一起。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接