在R中使用data.table进行查找

Question

在R中使用data.table进行查找

4

我有一个数据集，以长格式包含关于个人的重复观测。因此，每行都是类型为A或B的观测。以下代码可复制数据集。

library(data.table)
set.seed(1487)
dat <- data.table(id = rep(seq(10), 2), 
                  type = c(rep("A", 10), rep("B", 10)), 
                  x = sample.int(100,20))
dat
#     id type  x
#  1:  1    A 38
#  2:  2    A 58
#  3:  3    A 28
#  4:  4    A 21
#  5:  5    A 19
#  6:  6    A 62
#  7:  7    A 52
#  8:  8    A 86
#  9:  9    A 85
# 10: 10    A 90
# 11:  1    B 15
# 12:  2    B 11
# 13:  3    B 37
# 14:  4    B 93
# 15:  5    B 34
# 16:  6    B 91
# 17:  7    B 79
# 18:  8    B 94
# 19:  9    B 24
# 20: 10    B 41

然后我选择根据观察类型使用x排名最高的3个个体：

setorderv(dat, c("type", "x"), c(1, -1))
top3 <- dat[, head(.SD, 3), by = list(type)]
top3
#    type id  x
# 1:    A 10 90
# 2:    A  8 86
# 3:    A  9 85
# 4:    B  8 94
# 5:    B  4 93
# 6:    B  6 91

现在我想添加一个包含相反观测类型的原始x值的列。如果有意义的话。因此，以下代码可以复制我正在寻找的内容：

top3[,x2 := c(41, 94, 24, 86, 21, 62)]
#    type id  x x2
# 1:    A 10 90 41
# 2:    A  8 86 94
# 3:    A  9 85 24
# 4:    B  8 94 86
# 5:    B  4 93 21
# 6:    B  6 91 62

当然，我可以逐行遍历整个数据集并使用if语句或其他方式。原始数据集非常大，我正在寻找一种优雅而高效的方法来完成它。我真的很喜欢data.table，并且最近一直在使用它。我知道有一种简单而优雅的方法来完成它。我还尝试过使用.GRP。我需要一些帮助。

提前感谢！

我的最终解决方案

感谢那些提供灵感的人。对于我的问题，这是我工作中实际运作得更好的解决方案。

dat <- dcast.data.table(dat, id~type, value.var = "x")
top3 <- rbind(dat[order(-A), head(.SD, 3L)][,rank_by := "A"],
              dat[order(-B), head(.SD, 3L)][,rank_by := "B"])
#    id  A  B rank_by
# 1: 10 90 41       A
# 2:  8 86 94       A
# 3:  9 85 24       A
# 4:  8 86 94       B
# 5:  4 21 93       B
# 6:  6 62 91       B

Cheers,

tstev

- tstev

4个回答

4

可能不是最优雅的方法，但它能够工作：

setkeyv(dat, c("type", "id"))

my.order <- dat[order(-rank(type)), .(id, type)]
dat[, x2 := dat[.(my.order$type, my.order$id), x]]

setorderv(dat, c("type", "x"), c(1, -1))
top3 <- dat[, head(.SD, 3), by = .(type)]
top3

# type id  x x2
# 1:    A 10 90 41
# 2:    A  8 86 94
# 3:    A  9 85 24
# 4:    B  8 94 86
# 5:    B  4 93 21
# 6:    B  6 91 62

编辑看完@eddi的回答和有关易读性的讨论，我想起了dplyr包。所以按照他的步骤：

library(dplyr)
dat %>%
  arrange(desc(x)) %>%
  group_by(type) %>%
  summarise_each(funs(head(., 3))) %>%
  left_join(., dat, by = "id") %>%
  filter(type.x != type.y) %>%
  arrange(type.x, desc(id))
#   id type.x x.x type.y x.y
# 1 10      A  90      B  41
# 2  9      A  85      B  24
# 3  8      A  86      B  94
# 4  8      B  94      A  86
# 5  6      B  91      A  62
# 6  4      B  93      A  21

- Andriy T.

这段代码有两个不便之处：将新列添加到整个数据而不是结果（最小值），因此您会得到比所需更多的数据；另一个是在创建新列时，data.table中的双[。 - Andriy T.

1

怎么样

subset(merge(top3, dat, by = "id"), type.x != type.y)[, type.y:=NULL][]   
#   id type.x x.x x.y
#1:  4      B  93  21
#2:  6      B  91  62
#3:  8      A  86  94
#4:  8      B  94  86
#5:  9      A  85  24
#6: 10      A  90  41

（为了保持与您帖子中的名称相同，您需要将其包装在setnames(..., c("id", "type", "x", "x2"))中）

- konvas

太棒了，解决方案非常好。非常感谢！ - tstev

0

可能不是最优雅的方式。但是，我建议使用以下代码：

## Merge separately for each type (drop type)
top3A <- merge(top3[top3$type =="A",2:3],dat[dat$type=="B",c(1,3)],by = c("id"))
top3B <- merge(top3[top3$type =="B",2:3],dat[dat$type=="A",c(1,3)],by = c("id"))
## add type which we dropped before
top3A$type <- "A"
top3B$type <- "B"
## combine both result sets
top3 <- rbind(top3A,top3B)
## rename columns and reorder/resort results
colnames(top3)[2:3] <- c("x","x2")
top3 <- top3[order(type,-id),c(4,1,2,3)]

敬祝好运

- bladeaka667

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- eddi · Accepted Answer

看起来您想按相反类型和id合并回去。根据您的具体情况，我可能会跳过更改类型，而是在两种类型上进行合并，并且放弃相同的类型（下面的代码假定版本为1.9.5+）：

(dat[order(-x), head(.SD, 3), by = type]
    [dat, on = 'id', nomatch = 0][type != i.type]
    [order(type, -id)])
#   type id  x i.type i.x
#1:    A 10 90      B  41
#2:    A  8 86      B  94
#3:    A  9 85      B  24
#4:    B  8 94      A  86
#5:    B  4 93      A  21
#6:    B  6 91      A  62