从分组数据中选择第一行和最后一行

Question

从分组数据中选择第一行和最后一行

rdplyr

201

问题

使用dplyr，如何在一条语句中选择分组数据的前几行和后几行观测值/行？

数据和示例

给定一个数据框：

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), 
                 stopId=c("a","b","c","a","b","c","a","b","c"), 
                 stopSequence=c(1,2,3,3,1,4,3,1,2))

使用slice可以从每个组中获取前面和后面的观察值，但需要使用两个单独的语句：

firstStop <- df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  slice(1) %>%
  ungroup

lastStop <- df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  slice(n()) %>%
  ungroup

我能否将这两个语句合并为一个语句，以选择顶部和底部的观察结果？

- tospig

请参阅如何在数据框中选择分组变量的第一行和最后一行？ - Henrik

10个回答

139

仅为完整性：您可以将索引向量传递给 slice：

df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))

提供了

  id stopId stopSequence
1  1      a            1
2  1      c            3
3  2      b            1
4  2      c            4
5  3      b            1
6  3      a            3

- Frank

可能比 filter 更快 - 没有测试过，但请参考这里。 - tjebo

1

与过滤器不同，切片可以多次返回相同的行，例如 mtcars[1, ] %>% slice(c(1, n()))，因此在这个意义上，它们之间的选择取决于你想要返回什么。我预计时间会很接近，除非 n 非常大（在这种情况下可能会更喜欢使用切片），但我没有测试过。 - Frank

18

不是 dplyr，但使用 data.table 更加直接：

library(data.table)
setDT(df)
df[
  df[order(id, stopSequence), .(rows = .I[c(1L,.N)]), by=id]$rows
]
    #  rows stopId stopSequence
    # 1:  1      a            1
    # 2:  1      c            3
    # 3:  2      b            1
    # 4:  2      c            4
    # 5:  3      b            1
    # 6:  3      a            3

更详细的解释：

# 1) get row numbers of first/last observations from each group
#    * basically, we sort the table by id/stopSequence, then,
#      grouping by id, name the row numbers of the first/last
#      observations for each id; since this operation produces
#      a data.table
#    * .I is data.table shorthand for the row number
#    * here, to be maximally explicit, I've named the variable rows
#      as row_num to give other readers of my code a clearer
#      understanding of what operation is producing what variable
first_last = df[order(id, stopSequence), .(rows = .I[c(1L,.N)]), by=id]
idx = first_last$rows

# 2) extract rows by number
df[idx]

一定要查看入门维基，以了解data.table的基础知识

- MichaelChirico

1

或者 df[df[order(stopSequence), .I[c(1,.N)], keyby=id]$V1]。我觉得出现两次 id 很奇怪。 - Frank

3

@ArtemKlevtsov - 但是您可能并不总是想设置键。 - SymbolixAU

2

或者 df[order(stopSequence), .SD[c(1L,.N)], by = id]。请参见此处。 - JWilliman

@JWilliman 这并不一定完全相同，因为它不会按 id 重新排序。我认为 df[order(stopSequence), .SD[c(1L, .N)], keyby = id] 应该可以解决问题（与上面的解决方案略有不同，结果将被 key）。 - MichaelChirico

@keweik 不用担心 :) - MichaelChirico

显示剩余4条评论

10

使用which.min和which.max：

library(dplyr, warn.conflicts = F)
df %>% 
  group_by(id) %>% 
  slice(c(which.min(stopSequence), which.max(stopSequence)))

#> # A tibble: 6 x 3
#> # Groups:   id [3]
#>      id stopId stopSequence
#>   <dbl> <fct>         <dbl>
#> 1     1 a                 1
#> 2     1 c                 3
#> 3     2 b                 1
#> 4     2 c                 4
#> 5     3 b                 1
#> 6     3 a                 3

基准测试

由于我们通过分组来寻找最小值和最大值，而不是对整个stopSequence列进行排序，因此这种方法比当前接受的答案要快得多。

# create a 100k times longer data frame
df2 <- bind_rows(replicate(1e5, df, F)) 
bench::mark(
  mm =df2 %>% 
    group_by(id) %>% 
    slice(c(which.min(stopSequence), which.max(stopSequence))),
  jeremy = df2 %>%
    group_by(id) %>%
    arrange(stopSequence) %>%
    filter(row_number()==1 | row_number()==n()))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 mm           22.6ms     27ms     34.9     14.2MB     21.3
#> 2 jeremy      254.3ms    273ms      3.66    58.4MB     11.0

- moodymudskipper

9

类似这样的：

library(dplyr)

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
                 stopId=c("a","b","c","a","b","c","a","b","c"),
                 stopSequence=c(1,2,3,3,1,4,3,1,2))

first_last <- function(x) {
  bind_rows(slice(x, 1), slice(x, n()))
}

df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  do(first_last(.)) %>%
  ungroup

## Source: local data frame [6 x 3]
## 
##   id stopId stopSequence
## 1  1      a            1
## 2  1      c            3
## 3  2      b            1
## 4  2      c            4
## 5  3      b            1
## 6  3      a            3

使用do，您可以对组执行任意数量的操作，但@jeremycg的答案更适合仅执行此任务。

- hrbrmstr

1

没有考虑编写一个函数 - 这确实是处理更复杂任务的好方法。 - tospig

1

相比于仅使用 slice，这似乎过于复杂了，例如 df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))。 - Frank

5

虽然我不反对(在帖子中我指出了jeremycg的回答更好)，但在此提供一个do示例可能有助于其他人(当slice无法发挥作用时，例如在一组上进行更复杂的操作)。并且，你应该把你的评论发布为答案(它是最好的答案)。 - hrbrmstr

9

我知道问题指定了dplyr，但由于其他人已经发布了使用其他包的解决方案，我决定也尝试使用其他包：

基础包：

df <- df[with(df, order(id, stopSequence, stopId)), ]
merge(df[!duplicated(df$id), ], 
      df[!duplicated(df$id, fromLast = TRUE), ], 
      all = TRUE)

data.table:

df <-  setDT(df)
df[order(id, stopSequence)][, .SD[c(1,.N)], by=id]

sqldf:

library(sqldf)
min <- sqldf("SELECT id, stopId, min(stopSequence) AS StopSequence
      FROM df GROUP BY id 
      ORDER BY id, StopSequence, stopId")
max <- sqldf("SELECT id, stopId, max(stopSequence) AS StopSequence
      FROM df GROUP BY id 
      ORDER BY id, StopSequence, stopId")
sqldf("SELECT * FROM min
      UNION
      SELECT * FROM max")

在一个查询中：

sqldf("SELECT * 
        FROM (SELECT id, stopId, min(stopSequence) AS StopSequence
              FROM df GROUP BY id 
              ORDER BY id, StopSequence, stopId)
        UNION
        SELECT *
        FROM (SELECT id, stopId, max(stopSequence) AS StopSequence
              FROM df GROUP BY id 
              ORDER BY id, StopSequence, stopId)")

输出：

  id stopId StopSequence
1  1      a            1
2  1      c            3
3  2      b            1
4  2      c            4
5  3      a            3
6  3      b            1

- mpalanco

5

这个很好用：

df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  slice(1,n())

# A tibble: 6 × 3
# Groups:   id [3]
#     id stopId stopSequence
#  <dbl> <chr>         <dbl>
#1     1 a                 1
#2     1 c                 3
#3     2 b                 1
#4     2 c                 4
#5     3 b                 1
#6     3 a                 3

- Vanessa

3

使用 data.table：

# convert to data.table
setDT(df) 
# order, group, filter
df[order(stopSequence)][, .SD[c(1, .N)], by = id]

   id stopId stopSequence
1:  1      a            1
2:  1      c            3
3:  2      b            1
4:  2      c            4
5:  3      b            1
6:  3      a            3

- s_baldur

1

一个不同的基础R的替代方法是先按id和stopSequence进行order，然后根据id进行split，对于每个id，我们只选择第一个和最后一个索引，并使用这些索引来子集化数据框。

df[sapply(with(df, split(order(id, stopSequence), id)), function(x) 
                   c(x[1], x[length(x)])), ]


#  id stopId stopSequence
#1  1      a            1
#3  1      c            3
#5  2      b            1
#6  2      c            4
#8  3      b            1
#7  3      a            3

或者类似地使用by

df[unlist(with(df, by(order(id, stopSequence), id, function(x) 
                   c(x[1], x[length(x)])))), ]

- Ronak Shah

1

使用lapply和dplyr语句的另一种方法。我们可以对同一个语句应用任意数量的摘要函数：

lapply(c(first, last), 
       function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>% 
bind_rows()

你可能会对最大stopSequence值的行感兴趣，可以这样做：

lapply(c(first, last, max("stopSequence")), 
       function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>%
bind_rows()

- Sahir Moosvi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jeremycg · Accepted Answer

可能有更快的方法：

df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  filter(row_number()==1 | row_number()==n())