SAS中proc summary对应的R函数是什么?

4
我正在寻找一些R代码,可以产生与SAS中的proc summary相同的输出结果。
我查看了这个帖子,其中有类似的问题:R: calculating column sums & row sums as an aggregation from a dataframe,但我的问题略有不同,因为我不想跨列求和,而是要跨行求和。
我的数据结构如下所示:
   Flag1   Flag2  Flag3   Type1 Type2 Type3
1     Level1    A    FIRST     2     0    0
2     Level1    A    SECOND    1     9    0
3     Level1    A    THIRD     3     7    0
4     Level1    A    FOURTH    9    18    0
5     Level1    A    FIFTH     1    22    0
6     Level1    A    SIXTH     1    13    0
7     Level1    B    FIRST     0     0    0
8     Level1    B    SECOND    3     9    0
9     Level1    B    THIRD     5    85    0
10    Level1    B    FOURTH    4    96    0
11    Level1    B    FIFTH     3    40    0
12    Level1    B    SIXTH     0    17    0
22    Level2    A    FIRST     2     0    0
23    Level2    A    SECOND    1     9    0
24    Level2    A    THIRD     3     7    0
25    Level2    A    FOURTH    9    18    0
26    Level2    A    FIFTH     1    22    0
27    Level2    A    SIXTH     1    13    0
28    Level2    B    FIRST     0     0    0
29    Level2    B    SECOND    3     9    0
30    Level2    B    THIRD     5    85    0
31    Level2    B    FOURTH    4    96    0
32    Level2    B    FIFTH     3    40    0
33    Level2    B    SIXTH     0    17    0
34    Level3    A    FIRST     2     0    0
35    Level3    A    SECOND    1     9    0
36    Level3    A    THIRD     3     7    0
37    Level3    A    FOURTH    9    18    0
38    Level3    A    FIFTH     1    22    0
39    Level3    A    SIXTH     1    13    0
40    Level3    B    FIRST     0     0    0
41    Level3    B    SECOND    3     9    0
42    Level3    B    THIRD     5    85    0
43    Level3    B    FOURTH    4    96    0
44    Level3    B    FIFTH     3    40    0
45    Level3    B    SIXTH     0    17    0

我希望结果看起来像这样,其中Sum可以是例如type1的总和。
    Flag1   Flag2   Flag3   Sum
1   Level1  A   FIRST   
2   Level1  A   SECOND  
3   Level1  A   THIRD   
4   Level1  A   FOURTH  
5   Level1  A   FIFTH   
6   Level1  A   SIXTH   
7   Level1  B   FIRST   
8   Level1  B   SECOND  
9   Level1  B   THIRD   
10  Level1  B   FOURTH  
11  Level1  B   FIFTH   
12  Level1  B   SIXTH  
13  Level1  (all)   FIRST   
14  Level1  (all)   SECOND  
15  Level1  (all)   THIRD   
16  Level1  (all)   FOURTH  
17  Level1  (all)   FIFTH   
18  Level1  (all)   SIXTH   
19  Level1  A   (all)   
20  Level1  B   (all)   
21  Level1  (all)   (all)  
22  Level2  A   FIRST   
23  Level2  A   SECOND  
24  Level2  A   THIRD   
25  Level2  A   FOURTH  
26  Level2  A   FIFTH   
27  Level2  A   SIXTH   
28  Level2  B   FIRST   
29  Level2  B   SECOND  
30  Level2  B   THIRD   
31  Level2  B   FOURTH  
32  Level2  B   FIFTH   
33  Level2  B   SIXTH   
34  Level2  (all)   FIRST 
35  Level2  (all)   SECOND
36  Level2  (all)   THIRD 
37  Level2  (all)   FOURTH
38  Level2  (all)   FIFTH 
39  Level2  (all)   SIXTH 
40  Level2  A   (all)   
41  Level2  B   (all)   
42  Level2  (all)   (all)   
43  Level3  A   FIRST   
44  Level3  A   SECOND  
45  Level3  A   THIRD   
46  Level3  A   FOURTH  
47  Level3  A   FIFTH   
48  Level3  A   SIXTH   
49  Level3  B   FIRST   
50  Level3  B   SECOND  
51  Level3  B   THIRD   
52  Level3  B   FOURTH  
53  Level3  B   FIFTH   
54  Level3  B   SIXTH   
55  Level3  (all)   FIRST   
56  Level3  (all)   SECOND  
57  Level3  (all)   THIRD   
58  Level3  (all)   FOURTH  
59  Level3  (all)   FIFTH   
60  Level3  (all)   SIXTH   
61  Level3  A   (all)   
62  Level3  B   (all)   
63  Level3  (all)   (all)   
64  (all)   A   FIRST   
65  (all)   A   SECOND  
66  (all)   A   THIRD   
67  (all)   A   FOURTH  
68  (all)   A   FIFTH   
69  (all)   A   SIXTH   
70  (all)   B   FIRST   
71  (all)   B   SECOND  
72  (all)   B   THIRD   
73  (all)   B   FOURTH  
74  (all)   B   FIFTH   
75  (all)   B   SIXTH   
76  (all)   A   (all)   
77  (all)   B   (all)   
78  (all)   (all)   (all)   

在 SAS 中,可以轻松使用以下代码实现:
PROC SUMMARY data=test;
class Flag1 Flag2 Flag3;
var Type1;
output=final_data Sum=sum(Type1);
run;

这将在所有级别中创建所有不同的小计。(而“nway”语句只会输出数据的最低级别)。我似乎找不到任何像这样简单的R代码?你能帮帮我吗?实际数据中有7个维度,因此硬编码的方法不是选项。如果上述内容不清楚,请提出详细问题。祝好。

示例输出中的总和值是否缺失? - StupidWolf
如果我理解正确的话,这将产生一个按组求和以及所有边际总和的结果。我怀疑这要么是迭代的,要么是重塑的(这将允许使用addmargins)。 - r2evans
我认为你在问这个问题:https://dev59.com/_rTma4cB1Zd3GeqP96an - Reeza
你能展示一下你期望的输出是什么样子吗? - Ronak Shah
大家好。 @r2evans 您部分正确,因为我还需要所有7个维度的不同组合,而不仅仅是7个边际总和(这是我要寻找的子集)。 Reeza 我不太确定,因为它似乎只创建了边际总和。尽管我可能非常错误? :) - Martin Hoffmann
你尝试过使用 library(mosaic)favstats(x, data=dataset_name) 吗? - Schilker
3个回答

1

好的,这是一个很好的问题...

我将您的数据制作成了这样:

data "c:\blah\test.sas7bdat";
input  id  Flag1 $   Flag2 $ Flag3 $  Type1 Type2 Type3;
datalines;
1     Level1    A    FIRST     2     0    0
2     Level1    A    SECOND    1     9    0
3     Level1    A    THIRD     3     7    0
4     Level1    A    FOURTH    9    18    0
5     Level1    A    FIFTH     1    22    0
6     Level1    A    SIXTH     1    13    0
7     Level1    B    FIRST     0     0    0
8     Level1    B    SECOND    3     9    0
9     Level1    B    THIRD     5    85    0
10    Level1    B    FOURTH    4    96    0
11    Level1    B    FIFTH     3    40    0
12    Level1    B    SIXTH     0    17    0
22    Level2    A    FIRST     2     0    0
23    Level2    A    SECOND    1     9    0
24    Level2    A    THIRD     3     7    0
25    Level2    A    FOURTH    9    18    0
26    Level2    A    FIFTH     1    22    0
27    Level2    A    SIXTH     1    13    0
28    Level2    B    FIRST     0     0    0
29    Level2    B    SECOND    3     9    0
30    Level2    B    THIRD     5    85    0
31    Level2    B    FOURTH    4    96    0
32    Level2    B    FIFTH     3    40    0
33    Level2    B    SIXTH     0    17    0
34    Level3    A    FIRST     2     0    0
35    Level3    A    SECOND    1     9    0
36    Level3    A    THIRD     3     7    0
37    Level3    A    FOURTH    9    18    0
38    Level3    A    FIFTH     1    22    0
39    Level3    A    SIXTH     1    13    0
40    Level3    B    FIRST     0     0    0
41    Level3    B    SECOND    3     9    0
42    Level3    B    THIRD     5    85    0
43    Level3    B    FOURTH    4    96    0
44    Level3    B    FIFTH     3    40    0
45    Level3    B    SIXTH     0    17    0
;
run;

在 SAS 中,我得到了这样的摘要:
PROC SUMMARY data="c:\blah\test.sas7bdat";
class Flag1 Flag2 Flag3;
var Type1;
output out =final_data (drop = _type_ _freq_) Sum=sum ;
run;

使用 haven 包将数据加载到 R 中,然后使用 dplyr 包中的函数进行处理。
library(haven)
library(dplyr)

# the read_sas() function is in the haven package
test <- read_sas("c:\\blah\\test.sas7bdat")

# This uses dplry functions for nway
done <- test %>%
  group_by(Flag1, Flag2, Flag3) %>%
  summarise(sum(Type1)) %>%
  ungroup() %>%
  rename(Sum = `sum(Type1)`)


# This uses dplry functions to do all the subgroups
done1 <- test %>%
  group_by(Flag1) %>%
  summarise(sum(Type1)) %>%
  rename(Sum = `sum(Type1)`)
done2 <- test %>%
  group_by(Flag1, Flag2) %>%
  summarise(sum(Type1)) %>%
  rename(Sum = `sum(Type1)`)
done3 <- test %>%
  group_by(Flag1, Flag2, Flag3) %>%
  summarise(sum(Type1)) %>%
  rename(Sum = `sum(Type1)`)

all <- bind_rows(done3, done2, done1)

好消息是dplyr使用易于理解的动词作为函数名称。如果您想了解更多,请查看R for Data Science
理论上,制作“done”对象的代码可以放入一个函数中,然后使用purrr包中的map函数运行。我还没有找到如何生成所有变量列表的方法。肯定有一个函数可以实现这个功能,但我找不到它。这篇文章可以帮你一部分: Yihui on combinations
stringsOfVariables <- capture.output(
  for (i in 1:n) {
    if (i == 1) {
      for (j in 1:n) {
        cat(x[j], "\n")
      }
    }
    else {
      for (j in 1:(n - i + 1)) {
        for (k in (j + i - 1):n) {
          cat(c(x[j:(j + i - 2)], x[k]), "\n",  sep = ",")
        }
      }
    }
  }
)
stringsOfVariables

这里有一个函数,可以为任何一组变量生成摘要。

library(rlang)
counts <- function(...) {
  vars <- enquos(...)  
  test %>%
    group_by(!!!vars) %>%
    summarise(sum(Type1))
}
done3 <- counts(Flag1,Flag2,Flag3)

也许其他人可以添加缺失的tidyverse部分...

1
从 @r2evans 借用示例数据,我们可以在这里使用 reshape2::dcast()。
这是一种有点不寻常的用法,因为我们仅使用它的聚合和边距属性,而不是将其重塑/旋转为更宽的形式。
请注意,data.table::dcast() 支持多个 value.var,但不支持边距,因此在这里无法帮助我们。
library(reshape2)
dcast(dat,  Flag1 + Flag2 + Flag3 ~ ., fun.aggregate = sum, value.var = "Type1", margins = TRUE)
#>     Flag1 Flag2  Flag3  .
#> 1  Level1     A  FIFTH  1
#> 2  Level1     A  FIRST  2
#> 3  Level1     A FOURTH  9
#> 4  Level1     A SECOND  1
#> 5  Level1     A  SIXTH  1
#> 6  Level1     A  THIRD  3
#> 7  Level1     A  (all) 17
#> 8  Level1     B  FIFTH  3
#> 9  Level1     B  FIRST  0
#> 10 Level1     B FOURTH  4
#> 11 Level1     B SECOND  3
#> 12 Level1     B  SIXTH  0
#> 13 Level1     B  THIRD  5
#> 14 Level1     B  (all) 15
#> 15 Level1 (all)  (all) 32
#> 16 Level2     A  FIFTH  1
#> 17 Level2     A  FIRST  2
#> 18 Level2     A FOURTH  9
#> 19 Level2     A SECOND  1
#> 20 Level2     A  SIXTH  1
#> 21 Level2     A  THIRD  3
#> 22 Level2     A  (all) 17
#> 23 Level2     B  FIFTH  3
#> 24 Level2     B  FIRST  0
#> 25 Level2     B FOURTH  4
#> 26 Level2     B SECOND  3
#> 27 Level2     B  SIXTH  0
#> 28 Level2     B  THIRD  5
#> 29 Level2     B  (all) 15
#> 30 Level2 (all)  (all) 32
#> 31 Level3     A  FIFTH  1
#> 32 Level3     A  FIRST  2
#> 33 Level3     A FOURTH  9
#> 34 Level3     A SECOND  1
#> 35 Level3     A  SIXTH  1
#> 36 Level3     A  THIRD  3
#> 37 Level3     A  (all) 17
#> 38 Level3     B  FIFTH  3
#> 39 Level3     B  FIRST  0
#> 40 Level3     B FOURTH  4
#> 41 Level3     B SECOND  3
#> 42 Level3     B  SIXTH  0
#> 43 Level3     B  THIRD  5
#> 44 Level3     B  (all) 15
#> 45 Level3 (all)  (all) 32
#> 46  (all) (all)  (all) 96

您可以重命名列,或者为了避免重命名,先创建一个常量列:
dat$whatev <- "Sum"
dcast(dat,  Flag1 + Flag2 + Flag3 ~ whatev, fun.aggregate = sum, value.var = "Type1", margins = TRUE)

获取同一表格中的所有总和。
由于其他答案已经提供了,如果您想要在同一表格中获取所有总和,可以这样做:
library(reshape2)

sum_with_margins <- function(col) {
   dat$whatev <- col
   dcast(dat,  Flag1 + Flag2 + Flag3 ~ whatev, fun.aggregate = sum, 
         value.var = col, margins = paste0("Flag",1:3))
}

Reduce(merge, lapply(paste0("Type",1:3),sum_with_margins))
#>     Flag1 Flag2  Flag3 Type1 Type2 Type3
#> 1   (all) (all)  (all)    96   948     0
#> 2  Level1 (all)  (all)    32   316     0
#> 3  Level1     A  (all)    17    69     0
#> 4  Level1     A  FIFTH     1    22     0
#> 5  Level1     A  FIRST     2     0     0
#> 6  Level1     A FOURTH     9    18     0
#> 7  Level1     A SECOND     1     9     0
#> 8  Level1     A  SIXTH     1    13     0
#> 9  Level1     A  THIRD     3     7     0
#> 10 Level1     B  (all)    15   247     0
#> 11 Level1     B  FIFTH     3    40     0
#> 12 Level1     B  FIRST     0     0     0
#> 13 Level1     B FOURTH     4    96     0
#> 14 Level1     B SECOND     3     9     0
#> 15 Level1     B  SIXTH     0    17     0
#> 16 Level1     B  THIRD     5    85     0
#> 17 Level2 (all)  (all)    32   316     0
#> 18 Level2     A  (all)    17    69     0
#> 19 Level2     A  FIFTH     1    22     0
#> 20 Level2     A  FIRST     2     0     0
#> 21 Level2     A FOURTH     9    18     0
#> 22 Level2     A SECOND     1     9     0
#> 23 Level2     A  SIXTH     1    13     0
#> 24 Level2     A  THIRD     3     7     0
#> 25 Level2     B  (all)    15   247     0
#> 26 Level2     B  FIFTH     3    40     0
#> 27 Level2     B  FIRST     0     0     0
#> 28 Level2     B FOURTH     4    96     0
#> 29 Level2     B SECOND     3     9     0
#> 30 Level2     B  SIXTH     0    17     0
#> 31 Level2     B  THIRD     5    85     0
#> 32 Level3 (all)  (all)    32   316     0
#> 33 Level3     A  (all)    17    69     0
#> 34 Level3     A  FIFTH     1    22     0
#> 35 Level3     A  FIRST     2     0     0
#> 36 Level3     A FOURTH     9    18     0
#> 37 Level3     A SECOND     1     9     0
#> 38 Level3     A  SIXTH     1    13     0
#> 39 Level3     A  THIRD     3     7     0
#> 40 Level3     B  (all)    15   247     0
#> 41 Level3     B  FIFTH     3    40     0
#> 42 Level3     B  FIRST     0     0     0
#> 43 Level3     B FOURTH     4    96     0
#> 44 Level3     B SECOND     3     9     0
#> 45 Level3     B  SIXTH     0    17     0
#> 46 Level3     B  THIRD     5    85     0

0

这是我认为可以给你所有边际的一次尝试。

前言

eg <- do.call(expand.grid, c(lapply(dat[1:3], function(a) c(NA, unique(a))),
                             stringsAsFactors = FALSE))
head(eg)
out <- do.call(
  Map,
  c(unname(eg), list(f = function(f1, f2, f3) {
    subx <- subset(dat, (is.na(f1) | f1 == Flag1) &
                          (is.na(f2) | f2 == Flag2) &
                          (is.na(f3) | f3 == Flag3))
    subx <- subx[,setdiff(colnames(subx), c("Flag1", "Flag2", "Flag3"))]
    c(sapply(subx, sum), "(all)" = sum(unlist(subx)))
  })))
out <- cbind.data.frame(
  sapply(eg, function(a) ifelse(is.na(a), "(all)", as.character(a))),
  do.call(rbind, out),
  stringsAsFactors = FALSE)
rownames(out) <- NULL # cosmetic
out <- out[order(out$Flag1 == "(all)", out$Flag1,
                 out$Flag2 == "(all)", out$Flag2,
                 out$Flag3 == "(all)", out$Flag3),]
out[c(1,2,7,8,13,14,20,21,64,84),] # cherry-pick for this view
#     Flag1 Flag2 Flag3 Type1 Type2 Type3 (all)
# 66 Level1     A FIFTH     1    22     0    23
# 18 Level1     A FIRST     2     0     0     2
# 6  Level1     A (all)    17    69     0    86
# 70 Level1     B FIFTH     3    40     0    43
# 46 Level1     B THIRD     5    85     0    90
# 10 Level1     B (all)    15   247     0   262
# 38 Level1 (all) THIRD     8    92     0   100
# 2  Level1 (all) (all)    32   316     0   348
# 65  (all)     A FIFTH     3    66     0    69
# 1   (all) (all) (all)    96   948     0  1044

步骤说明

  1. 第一部分生成一个框架,其中包含所有现有的Flag*变量的可能组合,以及特殊的NA(稍后解释):

    eg <- do.call(expand.grid, c(lapply(dat[1:3], function(a) c(NA, unique(a))),
                                 stringsAsFactors = FALSE))
    head(eg)
    #    Flag1 Flag2 Flag3
    # 1   <NA>  <NA>  <NA>
    # 2 Level1  <NA>  <NA>
    # 3 Level2  <NA>  <NA>
    # 4 Level3  <NA>  <NA>
    # 5   <NA>     A  <NA>
    # 6 Level1     A  <NA>
    nrow(eg)
    # [1] 84
    
  2. (这看起来很复杂。)Map迭代eg的每一行,但它通常将单个列表/向量组件作为参数。因为eg已经是一个listdata.frame),所以我需要使用do.call将列从eg传递为单独的(未命名的)参数。(从技术上讲,我可以做Map(function(...)..., eg[,1], eg[,2], eg[,3]),但这不太通用,以防您有非常量或更多列。)

    在函数内部,它基于简单的标准对整个框架进行子集处理:如果来自egFlag*变量是NA,则接受该标志的所有值,否则精确匹配。(NB,这做了一些范围违规,因为它正在从匿名函数中寻找dat。)

    out <- do.call(
      Map,
      c(unname(eg), list(f = function(f1, f2, f3) {
        subx <- subset(dat, (is.na(f1) | f1 == Flag1) &
                              (is.na(f2) | f2 == Flag2) &
                              (is.na(f3) | f3 == Flag3))
        subx <- subx[,setdiff(colnames(subx), c("Flag1", "Flag2", "Flag3"))]
        c(sapply(subx, sum), "(all)" = sum(unlist(subx)))
      })))
    head(out, n=3)
    # $<NA>
    # Type1 Type2 Type3 (all) 
    #    96   948     0  1044 
    # $Level1
    # Type1 Type2 Type3 (all) 
    #    32   316     0   348 
    # $Level2
    # Type1 Type2 Type3 (all) 
    #    32   316     0   348 
    
  3. NA重命名为您的美学(all),确保为character(而不是factor)。 (删除使用cbind...创建的行名称,只是外观上的。)

    out <- cbind.data.frame(
      sapply(eg, function(a) ifelse(is.na(a), "(all)", as.character(a))),
      do.call(rbind, out),
      stringsAsFactors = FALSE)
    rownames(out) <- NULL
    
  4. 按标志对它们进行排序,将(all)放在最

    我意识到Flag3的排序不是上下文序数。因此,我建议使用factor

    免责声明:我没有验证所有的总和是否正确。


    数据,没有因子

    dat <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
       Flag1   Flag2  Flag3   Type1 Type2 Type3
    1     Level1    A    FIRST     2     0    0
    2     Level1    A    SECOND    1     9    0
    3     Level1    A    THIRD     3     7    0
    4     Level1    A    FOURTH    9    18    0
    5     Level1    A    FIFTH     1    22    0
    6     Level1    A    SIXTH     1    13    0
    7     Level1    B    FIRST     0     0    0
    8     Level1    B    SECOND    3     9    0
    9     Level1    B    THIRD     5    85    0
    10    Level1    B    FOURTH    4    96    0
    11    Level1    B    FIFTH     3    40    0
    12    Level1    B    SIXTH     0    17    0
    22    Level2    A    FIRST     2     0    0
    23    Level2    A    SECOND    1     9    0
    24    Level2    A    THIRD     3     7    0
    25    Level2    A    FOURTH    9    18    0
    26    Level2    A    FIFTH     1    22    0
    27    Level2    A    SIXTH     1    13    0
    28    Level2    B    FIRST     0     0    0
    29    Level2    B    SECOND    3     9    0
    30    Level2    B    THIRD     5    85    0
    31    Level2    B    FOURTH    4    96    0
    32    Level2    B    FIFTH     3    40    0
    33    Level2    B    SIXTH     0    17    0
    34    Level3    A    FIRST     2     0    0
    35    Level3    A    SECOND    1     9    0
    36    Level3    A    THIRD     3     7    0
    37    Level3    A    FOURTH    9    18    0
    38    Level3    A    FIFTH     1    22    0
    39    Level3    A    SIXTH     1    13    0
    40    Level3    B    FIRST     0     0    0
    41    Level3    B    SECOND    3     9    0
    42    Level3    B    THIRD     5    85    0
    43    Level3    B    FOURTH    4    96    0
    44    Level3    B    FIFTH     3    40    0
    45    Level3    B    SIXTH     0    17    0")
    

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接