将重复的值拼接在一起并将其添加到数据框中作为新列

3

我有一个这样的 df:

ID  Country
55  Poland
55  Romania
55  France
98  Spain
98  Portugal
98  UK
65  Germany
67  Luxembourg
84  Greece
22  Estonia
22  Lithuania

在某些情况下,ID 会重复,因为它们属于同一组。我想做的是将所有具有相同 IDCountry 粘贴在一起,以得到这样的输出。

enter image description here

到目前为止,我已经尝试使用 ifelse(df[duplicated(df$ID) | duplicated(df$ID, fromLast = TRUE),], paste('Countries', df$Country), NA),但这并没有得到期望的输出结果。

5个回答

7

使用 data.table

library(data.table)

setDT(df)[, New_Name := c(paste0(Country, collapse = " + ")[1L],  rep(NA, .N -1)), by = ID]

#df
#ID    Country                  New_Name
#1: 55     Poland Poland + Romania + France
#2: 55    Romania                      <NA>
#3: 55     France                      <NA>
#4: 98      Spain     Spain + Portugal + UK
#5: 98   Portugal                      <NA>
#6: 98         UK                      <NA>
#7: 65    Germany                   Germany
#8: 67 Luxembourg                Luxembourg
#9: 84     Greece                    Greece
#10: 22    Estonia       Estonia + Lithuania
#11: 22  Lithuania                      <NA>

另一种可能性是:setDT(df)[rowid(ID)==1L, nn := df[, paste(Country, collapse=" + "), ID]$V1] - chinsoon12

5

使用基础 R,

replace(v1 <- with(df, ave(as.character(Country), ID, FUN = toString)), duplicated(v1), NA)

#[1] "Poland, Romania, France" NA      NA    "Spain, Portugal, UK"     NA        NA    "Germany"      "Luxembourg"              "Greece"                  "Estonia, Lithuania"     
#[11] NA 

4
使用dplyr,一种方法是:
library(dplyr)
df %>%
  group_by(ID) %>%
  mutate(new_name = paste0(Country,collapse = " + "), 
         new_name = replace(new_name, duplicated(new_name), NA))

#     ID Country    new_name                 
#   <int> <fct>      <chr>                    
# 1    55 Poland     Poland + Romania + France
# 2    55 Romania    NA                       
# 3    55 France     NA                       
# 4    98 Spain      Spain + Portugal + UK    
# 5    98 Portugal   NA                       
# 6    98 UK         NA                       
# 7    65 Germany    Germany                  
# 8    67 Luxembourg Luxembourg               
# 9    84 Greece     Greece                   
#10    22 Estonia    Estonia + Lithuania      
#11    22 Lithuania  NA                  

然而,为了得到您精确的预期输出,我们可能需要一些信息。
df %>%
   group_by(ID) %>%
   mutate(new_name = if (n() > 1) 
         paste0("Countries ", paste0(Country,collapse = " + ")) else Country,
         new_name = replace(new_name, duplicated(new_name), NA))



#     ID Country    new_name                           
#    <int> <fct>      <chr>                              
# 1    55 Poland     Countries Poland + Romania + France
# 2    55 Romania    NA                                 
# 3    55 France     NA                                 
# 4    98 Spain      Countries Spain + Portugal + UK    
# 5    98 Portugal   NA                                 
# 6    98 UK         NA                                 
# 7    65 Germany    Germany                            
# 8    67 Luxembourg Luxembourg                         
# 9    84 Greece     Greece                             
#10    22 Estonia    Countries Estonia + Lithuania      
#11    22 Lithuania  NA                              

要得到与原始问题完全相同的结果,请添加 ... mutate(new_name = paste("Countries",paste0(Country,collapse = "+")),... - boski
@RonakShah 谢谢!但是我如何在国家组的开头仅添加一次“国家”,而不是每次列出新国家时都添加?即“国家波兰+罗马尼亚+法国”。 - Biostatician
@Biostatician 哎呀..抱歉。没有注意到Countries部分重复了。已更新答案。 - Ronak Shah

3

首次使用aggregate,然后进行match

flat <- function(x) paste("Countries:", paste(x,collapse=", "))
tmp <- aggregate(Country ~ ID, data=dat, FUN=flat)
dat$Country <- NA
dat$Country[match(tmp$ID, dat$ID)] <- tmp$Country

#   ID                            Country
#1  55 Countries: Poland, Romania, France
#2  55                               <NA>
#3  55                               <NA>
#4  98     Countries: Spain, Portugal, UK
#5  98                               <NA>
#6  98                               <NA>
#7  65                 Countries: Germany
#8  67              Countries: Luxembourg
#9  84                  Countries: Greece
#10 22      Countries: Estonia, Lithuania
#11 22                               <NA>

1
使用 purrrdplyr:

    df %>%
    nest(-ID) %>% 
    mutate(new_name = map_chr(data, ~ paste0(.x$Country, collapse = " + "))) %>% 
    unnest()

表格:

  ID new_name                  Country     
  55 Poland + Romania + France Poland    
  55 Poland + Romania + France Romania   
  55 Poland + Romania + France France    
  98 Spain + Portugal + UK     Spain     
  98 Spain + Portugal + UK     Portugal  
  98 Spain + Portugal + UK     UK        
  65 Germany                   Germany   
  67 Luxembourg                Luxembourg
  84 Greece                    Greece    
  22 Estonia + Lithuania       Estonia   
  22 Estonia + Lithuania       Lithuania 

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接