使用 summarise 和 across 函数来聚合字符串变量

3

df_input是输入文件,而理想的输出文件是df_output。

df_input <- data.frame(id  = c(1,2,3,4,4,5,5,5,6,7,8,9,10),
                       party = c("A","B","C","D","E","F","G","H","I","J","K","L","M"), 
                       winner= c(1,1,1,1,1,1,1,1,1,1,1,1,1))
                           

df_output <- data.frame(id  = c(1,2,3,4,5,6,7,8,9,10),
                        party = c("A","B","C","D,E","F_G_H","I","J","K","L","M"),
                        winner_sum = c(1,1,1,2,3,1,1,1,1,1))  

之前的代码使用 "summarise_at" 函数,如下所示:

df_output <- df_input %>%
  dplyr::group_by_at(.vars = vars(id)) %>%
  {left_join(
    dplyr::summarise_at(., vars(party), ~ str_c(., collapse = ",")),
    dplyr::summarise_at(., vars(winner), funs(sum))
  )} 

但它似乎已不再起作用,因为“summarise_at”和“funs”已被弃用。

我正在尝试在dplyr(1.0.10)中使用across进行复制,但是我遇到了一个错误。以下是我的尝试:

df_output <- df_input %>% 
  group_by(id) %>% 
  summarise(across(winner, sum, na.rm=T)) %>%
  summarise(across(party, str_c(., collapse = ",")))

我有多个数值和字符变量,不仅仅是例子中的一个。非常感谢。

1个回答

4

如果我们需要在单个列上应用不同的函数,就不需要使用 across

library(dplyr)
library(stringr)
df_input %>% 
    group_by(id) %>% 
    summarise(party = str_c(party, collapse = ","),
        winner_sum = sum(winner))

-输出

# A tibble: 10 × 3
      id party winner_sum
   <dbl> <chr>      <dbl>
 1     1 A              1
 2     2 B              1
 3     3 C              1
 4     4 D,E            2
 5     5 F,G,H          3
 6     6 I              1
 7     7 J              1
 8     8 K              1
 9     9 L              1
10    10 M              1

如果有多个“party”、“winner”列,请在单个“summarise”中循环遍历它们。因为第一个“summarise”之后,我们只有带有分组列的汇总列。
df_input %>% 
  group_by(id) %>% 
  summarise(across(winner, sum, na.rm=TRUE),
            across(party, ~ str_c(.x, collapse = ",")), .groups = "drop")

-输出

# A tibble: 10 × 3
      id winner party
   <dbl>  <dbl> <chr>
 1     1      1 A    
 2     2      1 B    
 3     3      1 C    
 4     4      2 D,E  
 5     5      3 F,G,H
 6     6      1 I    
 7     7      1 J    
 8     8      1 K    
 9     9      1 L    
10    10      1 M   

注意:如果这些列有相似的前缀,则使用 starts_with 来选择所有这些列,例如 across(starts_with("party"), 或者如果它们有不同的列名 - across(c(party, othercol), 或者如果应用的函数基于它们的类型 - across(where(is.numeric), sum,, na.rm = TRUE)

df_input %>%
    group_by(id) %>%
    summarise(across(where(is.numeric), sum, na.rm = TRUE),
             across(where(is.character), str_c, collapse = ","),
     .groups = 'drop')

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接