合并数据表，保留最近的行或添加新行

Question

合并数据表，保留最近的行或添加新行

3

假设我有两个数据表，我想按两个变量合并它们，但更新另一列（时间）大于原始值的条目。此外，它应该是一个完全连接，因此如果新数据中有新变量，则应追加它们。这个问题有什么好的解决方案？

例如：

## Initial data
dt1 <- data.table(user=c('a', 'a', 'b'), 
                   cell=c(1, 2, 1),
                   expires=as.POSIXct(rep('Jan 25 21:24', 3), format='%b %d %H:%M'))

## New data to update initial
dt2 <- data.table(user=c('a', 'c'), 
                 cell=c(1, 1),
                 expires=as.POSIXct(rep('Jan 25 21:59', 2), format='%b %d %H:%M'))

## Attempt
merge(dt1, dt2, by=c('user', 'cell'), all=TRUE)[
  , expires := pmax(expires.x, expires.y, na.rm=TRUE)][]

## Desired result: user a in cell 1 has been updated, user c has been added
(res <- rbindlist(list(dt2, dt1[2:3,]))[order(user, cell)])
#    user cell             expires
# 1:    a    1 2016-01-25 21:59:00
# 2:    a    2 2016-01-25 21:24:00
# 3:    b    1 2016-01-25 21:24:00
# 4:    c    1 2016-01-25 21:59:00

- Rorschach

你的尝试和期望解决方案之间唯一的区别是额外的列吗？ - Pierre L

一个选项是

res1 <- dt1[dt2, expires:= pmax(expires, i.expires) , on =c('user', 'cell'), by = .EACHI];res2 <- dt2[dt1, expires:= pmax(expires, i.expires) , on =c('user', 'cell'), by = .EACHI]; unique(rbindlist(list(res1, res2)))

。 - akrun

2个回答

2

从我的角度来看，你已经接近解决方案，只需要按照以下方式扩展你的链操作：

require(data.table)
dt1 <- data.table(user=c('a', 'a', 'b'), 
                  cell=c(1, 2, 1),
                  expires=as.POSIXct(rep(Sys.time(), 3)) )
# user cell             expires
# 1:    a    1 2016-01-26 11:19:49
# 2:    a    2 2016-01-26 11:19:49
# 3:    b    1 2016-01-26 11:19:49


## New data to update initial
dt2 <- data.table(user=c('a', 'c'), 
                  cell=c(1, 1),
                  expires=as.POSIXct(rep(Sys.time(), 2)) )
# user cell             expires
# 1:    a    1 2016-01-26 11:20:46
# 2:    c    1 2016-01-26 11:20:46

## Attempt
res_merge = merge(dt1, dt2, by=c('user', 'cell'), all=TRUE)[
  , expires := pmax(expires.x, expires.y, na.rm=TRUE)][, `:=`(expires.x=NULL,expires.y=NULL)][]

# user cell             expires
# 1:    a    1 2016-01-26 11:20:46
# 2:    a    2 2016-01-26 11:19:49
# 3:    b    1 2016-01-26 11:19:49
# 4:    c    1 2016-01-26 11:20:46

- Bruno Sarrant

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David Arenburg · Accepted Answer

看起来你需要运行一个外部连接（通常不太内存高效），所以只需运行rbind应该是计算上更便宜的，然后只需简单的order（似乎利用了data.tble的forder），再包装在data.table的unique方法中，这看起来很有前途。

unique(rbind(dt1, dt2)[order(-expires)], by = c("user", "cell"))
#    user cell             expires
# 1:    a    1 2016-01-25 21:59:00
# 2:    c    1 2016-01-25 21:59:00
# 3:    a    2 2016-01-25 21:24:00
# 4:    b    1 2016-01-25 21:24:00