将重复的值拼接在一起并将其添加到数据框中作为新列

Question

将重复的值拼接在一起并将其添加到数据框中作为新列

3

我有一个这样的 df:

ID  Country
55  Poland
55  Romania
55  France
98  Spain
98  Portugal
98  UK
65  Germany
67  Luxembourg
84  Greece
22  Estonia
22  Lithuania

在某些情况下，ID 会重复，因为它们属于同一组。我想做的是将所有具有相同 ID 的 Country 粘贴在一起，以得到这样的输出。

到目前为止，我已经尝试使用 ifelse(df[duplicated(df$ID) | duplicated(df$ID, fromLast = TRUE),], paste('Countries', df$Country), NA)，但这并没有得到期望的输出结果。

- Biostatician

5个回答

5

使用基础 R，

replace(v1 <- with(df, ave(as.character(Country), ID, FUN = toString)), duplicated(v1), NA)

#[1] "Poland, Romania, France" NA      NA    "Spain, Portugal, UK"     NA        NA    "Germany"      "Luxembourg"              "Greece"                  "Estonia, Lithuania"     
#[11] NA

- Sotos

4

使用dplyr，一种方法是：

library(dplyr)
df %>%
  group_by(ID) %>%
  mutate(new_name = paste0(Country,collapse = " + "), 
         new_name = replace(new_name, duplicated(new_name), NA))

#     ID Country    new_name                 
#   <int> <fct>      <chr>                    
# 1    55 Poland     Poland + Romania + France
# 2    55 Romania    NA                       
# 3    55 France     NA                       
# 4    98 Spain      Spain + Portugal + UK    
# 5    98 Portugal   NA                       
# 6    98 UK         NA                       
# 7    65 Germany    Germany                  
# 8    67 Luxembourg Luxembourg               
# 9    84 Greece     Greece                   
#10    22 Estonia    Estonia + Lithuania      
#11    22 Lithuania  NA

然而，为了得到您精确的预期输出，我们可能需要一些信息。

df %>%
   group_by(ID) %>%
   mutate(new_name = if (n() > 1) 
         paste0("Countries ", paste0(Country,collapse = " + ")) else Country,
         new_name = replace(new_name, duplicated(new_name), NA))



#     ID Country    new_name                           
#    <int> <fct>      <chr>                              
# 1    55 Poland     Countries Poland + Romania + France
# 2    55 Romania    NA                                 
# 3    55 France     NA                                 
# 4    98 Spain      Countries Spain + Portugal + UK    
# 5    98 Portugal   NA                                 
# 6    98 UK         NA                                 
# 7    65 Germany    Germany                            
# 8    67 Luxembourg Luxembourg                         
# 9    84 Greece     Greece                             
#10    22 Estonia    Countries Estonia + Lithuania      
#11    22 Lithuania  NA

- Ronak Shah

要得到与原始问题完全相同的结果，请添加 ... mutate（new_name = paste（"Countries"，paste0（Country，collapse = "+"）），...。 - boski

@RonakShah 谢谢！但是我如何在国家组的开头仅添加一次“国家”，而不是每次列出新国家时都添加？即“国家波兰+罗马尼亚+法国”。 - Biostatician

@Biostatician 哎呀..抱歉。没有注意到Countries部分重复了。已更新答案。 - Ronak Shah

3

首次使用aggregate，然后进行match：

flat <- function(x) paste("Countries:", paste(x,collapse=", "))
tmp <- aggregate(Country ~ ID, data=dat, FUN=flat)
dat$Country <- NA
dat$Country[match(tmp$ID, dat$ID)] <- tmp$Country

#   ID                            Country
#1  55 Countries: Poland, Romania, France
#2  55                               <NA>
#3  55                               <NA>
#4  98     Countries: Spain, Portugal, UK
#5  98                               <NA>
#6  98                               <NA>
#7  65                 Countries: Germany
#8  67              Countries: Luxembourg
#9  84                  Countries: Greece
#10 22      Countries: Estonia, Lithuania
#11 22                               <NA>

- thelatemail

1

使用 purrr 和 dplyr:

    df %>%
    nest(-ID) %>% 
    mutate(new_name = map_chr(data, ~ paste0(.x$Country, collapse = " + "))) %>% 
    unnest()

表格：

  ID new_name                  Country     
  55 Poland + Romania + France Poland    
  55 Poland + Romania + France Romania   
  55 Poland + Romania + France France    
  98 Spain + Portugal + UK     Spain     
  98 Spain + Portugal + UK     Portugal  
  98 Spain + Portugal + UK     UK        
  65 Germany                   Germany   
  67 Luxembourg                Luxembourg
  84 Greece                    Greece    
  22 Estonia + Lithuania       Estonia   
  22 Estonia + Lithuania       Lithuania

- tvdo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Veerendra Gadekar · Accepted Answer

使用 data.table

library(data.table)

setDT(df)[, New_Name := c(paste0(Country, collapse = " + ")[1L],  rep(NA, .N -1)), by = ID]

#df
#ID    Country                  New_Name
#1: 55     Poland Poland + Romania + France
#2: 55    Romania                      <NA>
#3: 55     France                      <NA>
#4: 98      Spain     Spain + Portugal + UK
#5: 98   Portugal                      <NA>
#6: 98         UK                      <NA>
#7: 65    Germany                   Germany
#8: 67 Luxembourg                Luxembourg
#9: 84     Greece                    Greece
#10: 22    Estonia       Estonia + Lithuania
#11: 22  Lithuania                      <NA>