按每个组的最大值筛选数据框。

Question

按每个组的最大值筛选数据框。

17

我有一个180,000 x 400的数据框，其中每行对应一个用户，但每个用户恰好有两行。

id   date  ...
1    2012    ...
3    2010    ...
2    2013    ...
2    2014    ...
1    2011    ...
3    2014    ...

我想对数据进行子集处理，只保留每个用户最近的行（即每个id的日期值最高的行）。

我首先尝试使用 which() 循环 ids，并在 sapply() 中使用 ifelse() 语句，但速度非常慢（我认为是 O(n^2)）。

然后我尝试按 id 对 df 进行排序，然后每次以两个为单位循环并比较相邻日期，但这也很慢（我想是因为 R 中的循环不太好用）。两个日期的比较是瓶颈，而排序几乎是瞬间完成的。

有没有一种方法可以矢量化比较？

来自Remove duplicates keeping entry with largest absolute value的解决方案。

aa <- df[order(df$id, -df$date), ] #sort by id and reverse of date
aa[!duplicated(aa$id),]

跑得非常快！！

- mattdevlin

2个回答

6

聚合操作也应该能够正常工作：

aggregate(date ~ id, df, max)

- talat

对于大数据集来说，这会非常非常慢... - David Arenburg

3

实际上，在我这台破旧的笔记本电脑上需要1.5秒钟，@DavidArenburg。你的基础解决方案少于1秒钟，所以向你祝贺。 - rawr

@rawr 是针对什么数据大小的？ - David Arenburg

1

dat data.frame 578922376 552.1 Mb 180000 402 - rawr

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David Arenburg · Accepted Answer

使用data.table包的简单快速方法如下：

library(data.table)
setDT(df)[, .SD[which.max(date)], id]
#    id date
# 1:  1 2012
# 2:  3 2014
# 3:  2 2014

或者（可能会更快，因为使用了关键字 keyed

 by ）
setkey(setDT(df), id)[, .SD[which.max(date)], id]


或者使用OP的想法，通过data.table包实现。

unique(setorder(setDT(df), id, -date), by = "id")


或者

setorder(setDT(df), id, -date)[!duplicated(id)]

或者使用基础R解决方案。
with(df, tapply(date, id, function(x) x[which.max(x)]))
##    1    2    3 
## 2012 2014 2014 


另一种方法

library(dplyr)
df %>%
  group_by(id) %>%
  filter(date == max(date)) # Will keep all existing columns but allow multiple rows in case of ties
# Source: local data table [3 x 2]
# Groups: id
# 
#   id date
# 1  1 2012
# 2  2 2014
# 3  3 2014


或者

df %>%
  group_by(id) %>%
  slice(which.max(date)) # Will keep all columns but won't return multiple rows in case of ties


或者

df %>%
  group_by(id) %>%
  summarise(max(date)) # Will remove all other columns and wont return multiple rows in case of ties