使用summarize(across(..., .fns = ...))与多变量函数相关

5

我的问题需要我对多列数据进行总结,但每列数据必须使用另外三列的多变量函数进行总结。

我的数据帧中有数百列包含关于数据集的不同统计信息。这里是一个结构类似、规模较小的数据帧。

df <- data.frame(a1_Avg = rnorm(10), 
                 a1_Std = runif(10), 
                 a2_Avg = rnorm(10), 
                 a2_Std = runif(10), 
                 Hour = c(1.0, 1.5, 2.0, 2.25, 2.5, 2.75, 3.0, 4.0, 4.5, 5.0),
                 Measurements = c(3, 3, 6, 6, 6, 6, 10, 7, 7, 2)) %>%

数据需要压缩成行,汇总每小时的数据块。对于平均值的汇总很容易:可以简单地对它们进行平均,因为每小时的测量次数是一致的。
  group_by(Hour) %>%
  summarize(across(matches("a._Avg"), ~ mean(.x), .names = "combined_{col}"),

但是合并标准差更加棘手,因为我需要从三个不同的列中获取信息来计算它。如果手动操作,我会这样做:

            combined_a1_Std = sqrt((1/n())*sum(a1_Std^2 + (a1_Avg - combined_a1_Avg)^2)),
            combined_a2_Std = sqrt((1/n())*sum(a2_Std^2 + (a2_Avg - combined_a2_Avg)^2)))

但是对于数百列来说这是不可行的。

有没有简单的方法可以做到这一点?

以下是完整的代码和期望输出:

set.seed(1)
df <- data.frame(a1_Avg = rnorm(10), 
                 a1_Std = runif(10), 
                 a2_Avg = rnorm(10), 
                 a2_Std = runif(10), 
                 Hour = c(1.0, 1.5, 2.0, 2.25, 2.5, 2.75, 3.0, 4.0, 4.5, 5.0),
                 Measurements = c(3, 3, 6, 6, 6, 6, 10, 7, 7, 2)) %>%
  mutate(Hour = floor(Hour)) %>%
  group_by(Hour) %>%
  summarize(across(matches("a._Avg"), ~ mean(.x), .names = "combined_{col}"),
            combined_a1_Std = sqrt((1/n())*sum(a1_Std^2 + (a1_Avg - combined_a1_Avg)^2)),
            combined_a2_Std = sqrt((1/n())*sum(a2_Std^2 + (a2_Avg - combined_a2_Avg)^2)))

df

   Hour combined_a1_Avg combined_a2_Avg combined_a1_Std combined_a2_Std
  <dbl>           <dbl>           <dbl>           <dbl>           <dbl>
1     1         -0.221          -0.0306           0.859           0.859
2     2          0.0672          0.819            1.17            1.17 
3     3          0.487           0.782            0.116           0.116
4     4          0.657          -0.957            0.795           0.795
5     5         -0.305           0.620            0.583           0.583
1个回答

7
一种方法是循环遍历一组列,然后通过替换列名中的子字符串来获取另一组列。
library(dplyr)
library(stringr)
out2 <- df %>% 
   mutate(Hour = floor(Hour)) %>%
   group_by(Hour) %>%
   summarize(across(matches("a\\d+_Avg"), ~ mean(.x),
    .names = "combined_{col}"), 
         across(matches('^a\\d+_Avg$'),
     ~ sqrt((1/n())*sum(get(str_replace(cur_column(), "Avg", "Std")) +
                   (. - get(str_c( "combined_", cur_column() )))^2)), 
      .names = "combined_{str_replace(.col, 'Avg', 'Std')}"))

- 对照操作手册的方式进行检查
out1 <- df %>%
   mutate(Hour = floor(Hour)) %>%
  group_by(Hour) %>%
  summarize(across(matches("a._Avg"), ~ mean(.x), .names = "combined_{col}"),
            combined_a1_Std = sqrt((1/n())*sum(a1_Std + (a1_Avg - combined_a1_Avg)^2)),
            combined_a2_Std = sqrt((1/n())*sum(a2_Std + (a2_Avg - combined_a2_Avg)^2)))
identical(out1, out2)
[1] TRUE

数据

set.seed(1)
df <- data.frame(a1_Avg = rnorm(10), 
                 a1_Std = runif(10), 
                 a2_Avg = rnorm(10), 
                 a2_Std = runif(10), 
                 Hour = c(1.0, 1.5, 2.0, 2.25, 2.5, 2.75, 3.0, 4.0, 4.5, 5.0),
                 Measurements = c(3, 3, 6, 6, 6, 6, 10, 7, 7, 2))

1
第一个“across”应该包含“'a\d+.Avg'”,而不是“'a._Avg'”吗? - jpdugo17
1
@jpdugo17 你是对的。那是原帖作者的代码。尽管在这里它起作用是因为 . 可以匹配任何字符,并且只有一个数字。 - akrun
1
谢谢!这就是我在寻找的东西。 - Alex Fox

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接