如何使用R以摘要格式组织数据

Question

如何使用R以摘要格式组织数据

3

我已经在R中创建了以下数据框。

My_DF

ID        Date                  Type       Remark      Price
PRT-11    2020-12-01 10:12:14   SS_RT      AT_1_O      1000
PRT-11    2020-12-01 10:12:14   SS_RT      AT_1_O      1200
PRT-11    2020-12-01 10:12:14   SS_RT      AT_1_O      1600
PRT-11    2020-12-01 10:12:14   SS_RG      AT_1_A      1600
PRT-11    2020-12-01 10:12:14   SS_RG      AT_1_B      1600
PRT-11    2020-12-01 10:12:14   SS_RG      AT_1_C      1000
PRT-11    2020-12-01 10:12:14   SS_RT      AT_1_Y      1200
PRT-11    2020-12-07 10:12:14   SS_RT      AT_1_U      1600
PRT-11    2020-12-07 10:12:14   SS_RI      AT_1_M      1600
PRT-11    2020-12-07 10:12:14   SS_RO      AT_1_P      1600

我希望将上述DF转换为以下结构的Dataframe，并将其转换为HTML格式，以便使用mailR库发送电子邮件。

我遵循了以下条件。

如果Type等于SS_RT，则为Type-A；如果Type不是SS_RT，则为Type-B；如果Type等于SS_RT且Remark等于AT_1_O，则为Type-A1；如果Type等于SS_RT且Remark不等于AT_1_O，则为Type-A2；Type-A1（排除）的公式为Type-A1除以Type-A1和Type-A2之和；Type-A2（排除）的公式为Type-A2除以Type-A1和Type-A2之和。

其余的百分比公式都很简单，以Total作为分母。

在数据框中，可能没有特定日期的条目。为此，我们需要确保在所有可用日期中，我们需要取最小和最大日期，并确保对于那些不可用的日期，我们在两个计数和总和列中显示值为0。

我已经将日期合并为两行，第一行用于计数，第二行用于总和，按照它们的逻辑定义进行分组。

- Viper

你好。请问第一个总价应该是4000吗？我的计算显示应该是5000（例如，1000 + 1200 + 1600 + 1200）。此外，您期望的输出包括每个月的日期。提供的数据集只相隔6天。我感觉已经有了完成数据集的答案，但您应该提供详细信息 - 日期序列是什么？ - Cole

@Cole：你的计算是正确的，应该是5000。此外，考虑到最小和最大日期，日期序列应该按降序排列。如果数据集中的最小日期是12月1日，最大日期是12月10日，并且2、4、6日的条目在数据集中不存在，则我们需要为它们创建一个空白条目。 - Viper

谢谢澄清。我不明白空白项的意思。为什么12月8日不会被创建？ - Cole

@Cole：这样用户就可以知道在特定日期没有条目，以便他们可以相应地进行检查。 - Viper

2个回答

3

这是一项繁忙的工作。基本上，它是按data.table中日期分组对基础R中你所说的内容进行了汇总。请注意，我不确定OP中的价格是否准确或者是否有其他问题。

dt[, Date := as.POSIXct(Date, "UTC")]
dt[,
   {
     t_ss_rt = Type == 'SS_RT'
     Type_A = sum(t_ss_rt)
     Type_B = .N - Type_A
     
     tot_Price = sum(Price)
     Type_A_price = sum(Price[t_ss_rt])
     Type_B_price = tot_Price - Type_A_price
     
     rm_ss_rt = t_ss_rt & Remark == 'AT_1_O'
     Type_A1 = sum(rm_ss_rt)
     Type_A2 = Type_A - Type_A1
     
     tot_An_Price = sum(Price[t_ss_rt])
     Type_A1_Price = sum(Price[rm_ss_rt])
     Type_A2_Price = tot_An_Price - Type_A1_Price
     
     Type_A1_Excl = Type_A1 / (Type_A1 + Type_A2)
     Type_A2_Excl = Type_A2 / (Type_A1 + Type_A2)
     
     .(c(Type_A, Type_A_price), c(Type_A / .N, Type_A_price / tot_Price),
       c(Type_A1, Type_A1_Price), c(Type_A1 / .N, Type_A1_Price / tot_Price),
       c(Type_A2, Type_A2_Price), c(Type_A2 / .N, Type_A2_Price / tot_Price),
       c(Type_B, Type_B_price), c(Type_B / .N, NA_real_), c(.N, tot_Price),
       c(Type_A1_Excl, Type_A1_Price / (tot_An_Price)), c(Type_A2_Excl, Type_A2_Price / tot_An_Price))
   },
   by = .(Date)]

对于这些结果：

                  Date   V1        V2   V3        V4   V5        V6   V7        V8   V9  V10  V11
1: 2020-12-01 10:12:14    4 0.5714286    3 0.4285714    1 0.1428571    3 0.4285714    7 0.75 0.25
2: 2020-12-01 10:12:14 5000 0.5434783 3800 0.4130435 1200 0.1304348 4200        NA 9200 0.76 0.24
3: 2020-12-07 10:12:14    1 0.3333333    0 0.0000000    1 0.3333333    2 0.6666667    3 0.00 1.00
4: 2020-12-07 10:12:14 1600 0.3333333    0 0.0000000 1600 0.3333333 3200        NA 4800 0.00 1.00

数据源：

library(data.table)

dt = data.table::fread(
"ID   ,     Date      ,            Type ,      Remark,      Price
PRT-11,    2020-12-01 10:12:14,   SS_RT,      AT_1_O   ,   1000
PRT-11,    2020-12-01 10:12:14,   SS_RT ,     AT_1_O  ,    1200
PRT-11,    2020-12-01 10:12:14,   SS_RT  ,    AT_1_O ,     1600
PRT-11,    2020-12-01 10:12:14,   SS_RG   ,   AT_1_A,      1600
PRT-11,    2020-12-01 10:12:14,   SS_RG    ,  AT_1_B     , 1600
PRT-11,    2020-12-01 10:12:14,   SS_RG     , AT_1_C    ,  1000
PRT-11,    2020-12-01 10:12:14,   SS_RT,      AT_1_Y   ,   1200
PRT-11,    2020-12-07 10:12:14,   SS_RT ,     AT_1_U  ,    1600
PRT-11,    2020-12-07 10:12:14,   SS_RI  ,    AT_1_M ,     1600
PRT-11,    2020-12-07 10:12:14,   SS_RO   ,   AT_1_P,      1600")

- Cole

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- denis · Accepted Answer

这里是一个 data.table 的解决方案。我尝试避免手动计算，而是想出了一种基于长宽转换的解决方案。以下是我的解决方案，其中包含了详细的步骤说明：

library(lubridate)
library(data.table)

dt <- setDT(dt)
dt[,Date := date(Date)]
dt[,type := fifelse(Type == "SS_RT",fifelse(Remark == "AT_1_O","A1","A2"),"B")]
## transform to wide
df2 <- rbind(dcast(data = dt,Date~type ,value.var = "Price",fill = 0)[,linetype := "count"],
             dcast(data = dt,Date~type ,value.var = "Price",fill = 0,fun.aggregate = sum)[,linetype := "value"])
## A and tot
df2[,tot := rowSums(.SD),.SDcols = c("A1","A2","B")]
df2[,A := A1+A2]
## create pc
cols <- c("A","A1","A2","B")
df2[,paste0(cols,"_pc") := lapply(.SD,function(x) round(x/tot*100) ),.SDcols = cols]
cols <- c("A1","A2")
df2[,paste0(cols,"_exc") := lapply(.SD,function(x) round(x/(A1+A2)*100) ),.SDcols = cols]
## add missing dates
df2 <- merge(CJ(Date = seq(min(dt$Date),max(dt$Date),1),linetype = c("count","value")),
             df2,all = T,by = c("Date","linetype"))

df2[is.na(df2)] <- 0
df2[,linetype := NULL]
df2

          Date   A1   A2    B  tot    A A_pc A1_pc A2_pc B_pc A1_exc A2_exc
 1: 2020-12-01    3    1    3    7    4   57    43    14   43     75     25
 2: 2020-12-01 3800 1200 4200 9200 5000   54    41    13   46     76     24
 3: 2020-12-02    0    0    0    0    0    0     0     0    0      0      0
 4: 2020-12-02    0    0    0    0    0    0     0     0    0      0      0
 5: 2020-12-03    0    0    0    0    0    0     0     0    0      0      0
 6: 2020-12-03    0    0    0    0    0    0     0     0    0      0      0
 7: 2020-12-04    0    0    0    0    0    0     0     0    0      0      0
 8: 2020-12-04    0    0    0    0    0    0     0     0    0      0      0
 9: 2020-12-05    0    0    0    0    0    0     0     0    0      0      0
10: 2020-12-05    0    0    0    0    0    0     0     0    0      0      0
11: 2020-12-06    0    0    0    0    0    0     0     0    0      0      0
12: 2020-12-06    0    0    0    0    0    0     0     0    0      0      0
13: 2020-12-07    0    1    2    3    1   33     0    33   67      0    100
14: 2020-12-07    0 1600 3200 4800 1600   33     0    33   67      0    100

首先，我根据您的规则创建了type变量：

dt[,Date := date(Date)]
dt[,type := fifelse(Type == "SS_RT",fifelse(Remark == "AT_1_O","A1","A2"),"B")]

我们知道A只是由A1和A2组成的。这使我能够将表格转换为宽度格式。我需要进行两次转换：一次进行计数，一次按类型求和。

dcast(data = dt,Date ~ type ,value.var = "Price",fill = 0)

         Date A1 A2 B 
1: 2020-12-01  3  1 3    
2: 2020-12-07  0  1 2

在这里，我统计每种类型的出现次数，因为它使用默认聚合：lenght。如果我使用sum作为聚合函数：

dcast(data = dt,Date~type ,value.var = "Price",fill = 0,fun.aggregate = sum)

         Date   A1   A2    B
1: 2020-12-01 3800 1200 4200
2: 2020-12-07    0 1600 3200

我添加了一个linetype变量，这将有助于我在之后添加缺失的日期（我使用它来使每个日期保持两条线）。

我绑定这两个，得到：

         Date   A1   A2    B linetype
1: 2020-12-01    3    1    3    count
2: 2020-12-07    0    1    2    count
3: 2020-12-01 3800 1200 4200    value
4: 2020-12-07    0 1600 3200    value

我会计算A和总计：

df2[,tot := rowSums(.SD),.SDcols = c("A1","A2","B")]
df2[,A := A1+A2]

我接下来会使用lapply和一个指定要转换的列的向量来计算百分比(_pc)和排除变量(_exc)。我使用fifelse来避免除以0：

cols <- c("A","A1","A2","B")
df2[,paste0(cols,"_pc") := lapply(.SD,function(x) round(x/tot*100) ),.SDcols = cols]
cols <- c("A1","A2")
df2[,paste0(cols,"_exc") := lapply(.SD,function(x) round(x/(A1+A2)*100) ),.SDcols = cols]


         Date   A1   A2    B linetype  tot    A A_pc A1_pc A2_pc B_pc A1_exc A2_exc
1: 2020-12-01    3    1    3    count    7    4   57    43    14   43     75     25
2: 2020-12-01 3800 1200 4200    value 9200 5000   54    41    13   46     76     24
3: 2020-12-07    0    1    2    count    3    1   33     0    33   67      0    100
4: 2020-12-07    0 1600 3200    value 4800 1600   33     0    33   67      0    100

然后，我通过将所有linetype和Date的组合进行合并，并保留所有行来添加缺失的日期。我使用CJ函数创建一个包含两个变量所有组合的data.table：

CJ(Date = seq(min(dt$Date),max(dt$Date),1),linetype = c("count","value"))
          Date linetype
 1: 2020-12-01    count
 2: 2020-12-01    value
 3: 2020-12-02    count
 4: 2020-12-02    value
 5: 2020-12-03    count
 6: 2020-12-03    value
 7: 2020-12-04    count
 8: 2020-12-04    value
 9: 2020-12-05    count
10: 2020-12-05    value
11: 2020-12-06    count
12: 2020-12-06    value
13: 2020-12-07    count
14: 2020-12-07    value

然后，用0替换缺失值并去掉linetype变量。

接下来可以使用setcolorder重新排列列，使用kabbleExtra（见这里）生成html输出。

您也可以使用dplyr进行相同操作，使用pivot_wider转换为宽表格，使用mutate_all代替lapply(.SD,...)进行计算，使用expand.grid代替CJ生成缺失日期的表格。