数据框过滤

Question

数据框过滤

9

我有以下数据框 df:

df = data.frame(col1    = c('a','a','a','a','a','b','b','c','d'),
                col2    = c('a','a','a','b','b','b','b','a','a'),
                height1 = c(NA,32,NA,NA,NA,NA,NA,25,NA),
                height2 = c(31,31.5,NA,NA,11,12,13,NA,NA),
                col3    = 1:9)

#  col1 col2 height1 height2 col3
#1    a    a      NA    31.0    1
#2    a    a      32    31.5    2
#3    a    a      NA      NA    3
#4    a    b      NA      NA    4
#5    a    b      NA    11.0    5
#6    b    b      NA    12.0    6
#7    b    b      NA    13.0    7
#8    c    a      25      NA    8
#9    d    a      NA      NA    9

我希望对于每一组col1, col2中的值，建立一个名为height的列，其中包含以下值：

如果height1和height2中都只有NA，则返回NA。
如果在height1中有一个值，则取该值。（对于一组col1, col2，列height1中最多只有一个非 NA值）
如果在height1中只有NA，但在height2中有一些非 NA值，则取height2中的第一个值。

我还需要保留col3中相应的值。

新的data.framenew.df将如下所示：

#  col1 col2 height col3
#1    a    a     32    2
#2    a    b     11    5
#3    b    b     12    6
#4    c    a     25    8
#5    d    a     NA    9

我更喜欢使用data.frame方法，它非常简洁，但我意识到我找不到一个！

- Colonel Beauvel

3个回答

2

使用dplyr：

df %>%
  mutate( 
    order = ifelse(!is.na(height1), 1, ifelse(!is.na(height2), 2, 3)),
    height = ifelse(!is.na(height1), height1, ifelse(!is.na(height2), height2, NA))
    ) %>%
  arrange( col1, col2, order, height) %>%
  distinct(col1, col2) %>%
  select( col1, col2, height, col3)

- bergant

不是dplyr的粉丝。看起来有点慢，但答案非常整洁！ - Colonel Beauvel

不要用同一個標準來衡量整潔度和速度！ :) - bergant

1

我使用 data.table（尽管我想例外地使用 data.frame 选项），但我发现我的解决方案不够优雅：

func = function(df)
{
    if(all(is.na(subset(df, select=c(height1,height2)))))
        return(df[1,])

    if(any(!is.na(df$height1)))
        return(df[!is.na(df$height1),])

    df[!is.na(df$height2),][1,]
}

setDT(df)
new.df=df[,func(.SD),by=list(col1,col2)]
new.df = data.frame(new.df)

new.df$height = ifelse(is.na(new.df$height1), new.df$height2, new.df$height1)

#> new.df
#  col1 col2 height1 height2 col3 height
#1    a    a      32    31.5    2     32
#2    a    b      NA    11.0    5     11
#3    b    b      NA    12.0    6     12
#4    c    a      25      NA    8     25
#5    d    a      NA      NA    9     NA

- Colonel Beauvel

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Cath · Accepted Answer

也许这不是你想要的优雅解决方案，但这里有一个基于base R的选项：

do.call("rbind",
        lapply(split(df,paste0(df$col1,df$col2)),
               function(tab) {
                 colnames(tab)[3:4] <- "height" 
                 out <- if(any(!is.na(tab[, 3]))) {
                           tab[which(!is.na(tab[,3])),-4]
                        } else {
                           if (any(!is.na(tab[,4]))) {
                              tab[which(!is.na(tab[,4]))[1],c(1:2,4:5)]
                           } else {
                              tab[1,-4]
                           }
                        }
                return(out)
               }
        )
      )

#       col1 col2 height col3
#    aa    a    a     32    2
#    ab    a    b     11    5
#    bb    b    b     12    6
#    ca    c    a     25    8
#    da    d    a     NA    9