dplyr：在每个组中获取最大值，但不包括每行中的值？

Question

dplyr：在每个组中获取最大值，但不包括每行中的值？

9

我有一个数据框，如下所示：

> df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'C'), x = c(7, 3, 5, 9, 2, 4))
> df
Source: local data frame [6 x 2]

  g x
1 A 7
2 A 3
3 B 5
4 B 9
5 B 2
6 C 4

我知道如何为每个组g添加具有最大x值的列：

> df %>% group_by(g) %>% mutate(x_max = max(x))
Source: local data frame [6 x 3]
Groups: g

  g x x_max
1 A 7     7
2 A 3     7
3 B 5     9
4 B 9     9
5 B 2     9
6 C 4     4

但是我想要的是对于每个组g，得到最大的x值，排除每行中的x值。

对于给定的示例，期望的输出如下：

Source: local data frame [6 x 3]
Groups: g

  g x x_max x_max_exclude
1 A 7     7             3
2 A 3     7             7
3 B 5     9             9
4 B 9     9             5
5 B 2     9             9
6 C 4     4            NA

我曾尝试使用 row_number() 来移除特定元素并取剩余部分的最大值，但是遇到了警告信息，并得到了错误的 -Inf 输出：

> df %>% group_by(g) %>% mutate(x_max = max(x), r = row_number(), x_max_exclude = max(x[-r]))
Source: local data frame [6 x 5]
Groups: g

  g x x_max r x_max_exclude
1 A 7     7 1          -Inf
2 A 3     7 2          -Inf
3 B 5     9 1          -Inf
4 B 9     9 2          -Inf
5 B 2     9 3          -Inf
6 C 4     4 1          -Inf
Warning messages:
1: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf
2: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf
3: In max(c(4, 9, 2)[-1:3]) :
  no non-missing arguments to max; returning -Inf

什么是在dplyr中获取此输出的最可读，最简洁和最有效的方法？对于我尝试使用row_number()的尝试失败的任何见解也将不胜感激。感谢您的帮助。

- Eric

这段代码是：summarise(group_by(df,g),max.x=max(x))吗？ - Shenglin Chen

谢谢，@Shenglin Chen，但是这与上面的示例中所需的输出不匹配。那会给我每个组的最大“x”值（返回带有3行的data_frame）。但是我想要的是一个与输入表格行数相同的data_frame，在其中行“r”的值给出了组“g”中除行“r”外的最大“x”值。请参见上面的“期望输出”以获取具体示例。 - Eric

4个回答

4

有趣的问题。这里有一种使用data.table的方法：

require(data.table)
setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]

这个想法是按列 x 排序，然后在这些索引上按 g 进行分组。由于我们有了排序的索引，在前 .N-1 行中，最大值是在 .N 处的值。对于第 .N 行，它是第 .N-1 行处的值。 .N 是一个特殊的变量，它存储每个组中的观测数。

我会让您和/或 dplyr 专家将其翻译成通俗易懂的语言（或者提供其他方法的解答）。

- Arun

感谢提供 data.table 版本，@Arun。我认为这与我目前最佳的 dplyr 解决方案在精神上相似（我刚刚发布了它），尽管我不太了解我的 data.table 是否完全相同。 - Eric

2

Eric，思路相似，但实现方式不同。你为每个组调用了 sort()，然后还有 ifelse()... - Arun

2

目前为止，这是我想出的最好的方法。不确定是否有更好的方法。

df %>% 
  group_by(g) %>% 
  mutate(x_max = max(x), 
         x_max2 = sort(x, decreasing = TRUE)[2], 
         x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% 
  select(-x_max2)

- Eric

你可以简化为：group_by(df, g) %>% mutate(max = ifelse(x == max(x), sort(x, decreasing = TRUE)[2], max(x)))。 - Steven Beaupré

1

另一种使用功能的方法：

df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x))
Source: local data frame [6 x 3]
Groups: g

  g x x_max_exclude
1 A 7             3
2 A 3             7
3 B 5             9
4 B 9             5
5 B 2             9
6 C 4            NA

我们编写了一个名为max_exclude的函数，它执行您描述的操作。

max_exclude <- function(v) {
  res <- c()
  for(i in seq_along(v)) {
    res[i] <- suppressWarnings(max(v[-i]))
  }
  res <- ifelse(!is.finite(res), NA, res)
  as.numeric(res)
}

它也适用于基础 R：

df$x_max_exclude <- with(df, ave(x, g, FUN=max_exclude))
Source: local data frame [6 x 3]

  g x x_max_exclude
1 A 7             3
2 A 3             7
3 B 5             9
4 B 9             5
5 B 2             9
6 C 4            NA

基准测试

小朋友们，这里有一个教训，要注意避免使用for循环！

big.df <- data.frame(g=rep(LETTERS[1:4], each=1e3), x=sample(10, 4e3, replace=T))


microbenchmark(
  plafort_dplyr = big.df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x)),
  plafort_ave = big.df$x_max_exclude <- with(big.df, ave(x, g, FUN=max_exclude)),
  StevenB = (big.df %>% 
    group_by(g) %>% 
    mutate(max = ifelse(row_number(desc(x)) == 1, x[row_number(desc(x)) == 2], max(x)))
    ),
  Eric = df %>% 
    group_by(g) %>% 
    mutate(x_max = max(x), 
           x_max2 = sort(x, decreasing = TRUE)[2], 
           x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% 
    select(-x_max2),
  Arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]
)

Unit: milliseconds
          expr       min        lq      mean    median        uq        max neval
 plafort_dplyr 75.219042 85.207442 89.247409 88.203225 90.627663 179.553166   100
   plafort_ave 75.907798 84.604180 87.136122 86.961251 89.431884 104.884294   100
       StevenB  4.436973  4.699226  5.207548  4.931484  5.364242  11.893306   100
          Eric  7.233057  8.034092  8.921904  8.414720  9.060488  15.946281   100
          Arun  1.789097  2.037235  2.410915  2.226988  2.423638   9.326272   100

- Pierre L

这似乎相当昂贵。不确定它是否适用于更大的数据集。 - Steven Beaupré

1

@StevenBeaupré 可能是这样。这只是另一个想法。 - Pierre L

1

@StevenBeaupré 我测试了速度。令人尴尬的慢。 - Pierre L

1

似乎在一些基准测试中（Eric和Arun），您没有使用“big.df”？ - talat

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Steven Beaupré · Accepted Answer

你可以尝试：

df %>% 
  group_by(g) %>% 
  arrange(desc(x)) %>% 
  mutate(max = ifelse(x == max(x), x[2], max(x)))

这将会得到：

#Source: local data frame [6 x 3]
#Groups: g
#
#  g x max
#1 A 7   3
#2 A 3   7
#3 B 9   5
#4 B 5   9
#5 B 2   9
#6 C 4  NA

基准测试

我已经在基准测试上尝试了到目前为止的解决方案：

df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE),
                 x = sample(1:10, 10e5, replace = TRUE))

library(microbenchmark)

mbm <- microbenchmark(
  steven = df %>% 
    group_by(g) %>% 
    arrange(desc(x)) %>% 
    mutate(max = ifelse(x == max(x), x[2], max(x))),
  eric = df %>% 
    group_by(g) %>% 
    mutate(x_max = max(x), 
           x_max2 = sort(x, decreasing = TRUE)[2], 
           x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% 
    select(-x_max2),
  arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g],
  times = 50
)

@Arun的data.table解决方案是最快的：

# Unit: milliseconds
#    expr       min        lq      mean    median       uq      max neval cld
#  steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448    50  b 
#    eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170    50   c
#    arun  44.48639  46.17961  54.65824  47.74142  48.9884 102.3830    50 a

enter image description here