如何在R中按组创建排序统计信息?

4

如何在R中按组计算顺序统计量。我想根据一列对结果进行聚合,然后每组仅返回1行。该行应为该组的第n个元素,根据某种排序确定。最好只使用基本函数。

x <- data.frame(Group=c("A","A", "A", "C", "C"), 
                Name=c("v", "u", "w", "x", "y"), 
                Quantity=c(3,3,4,2,0))
> x
  Group Name Quantity
1     A    v        3
2     A    u        3
3     A    w        4
4     C    x        2
5     C    y        0

我想根据数量和名称排序来获取第N高的值。当N=2时,结果为

  Group Name Quantity
1     A    u        3
5     C    y        0

For N=1
  Group Name Quantity
3     A    w        4
4     C    x        2

我尝试了以下方法,但是收到了一个不太有用的错误信息。
 aggregate.data.frame(x, list(x$Group), function(y){ max(y[,'Quantity'])})
 Error in `[.default`(y, , "Quantity") (from #1) : incorrect number of dimensions"
4个回答

2
x <- 
    data.frame(
        Group = c("A","A", "A", "C", "C", "A", "A") , 
        Name = c("v", "u", "w", "x", "y" ,"v", "u") , 
        Quantity = c(3,3,4,2,0,4,1)
    )

# sort your data to start..
# note that Quantity vs. Group and Name
# are sorted in different directions,
# so the -as.numeric() flips them
x <- 
    x[ 
        order( 
            -as.numeric( x$Group ) , 
            x$Quantity , 
            -as.numeric( x$Name ) , 
            decreasing = TRUE 
        ) , 
    ]
# once your data frame is sorted the way you want your Ns to occur, the rest is easy

# rank your data..  
# just create the numerical order, 
# but within each group..
# (or you could add those ranks directly to the data frame if you like)
ranks <- 
    unlist( 
        tapply( 
            order( x$Group ) , 
            as.numeric( x$Group ) , 
            order 
        ) 
    )

# N = 1
x[ ranks == 1 , ]

# N = 2
x[ ranks == 2 , ]

我认为你的“N”和“ranks”应该一致。 x [ranks == 2,] $ Name 返回 c('v','y') 而不是所需的 c('u','y')。 我最初也掉进了同样的陷阱。 - Matthew Lundberg
通过编辑,您正在获取每个组中Name的最小值,这在示例中恰好是正确的,因为在排名1的情况下只有一个Name值,但通常情况下不正确。 - Matthew Lundberg

1

一些聚合-合并的魔法:

f <- function(x, N) {
  sel <- function(x) {                                   # Choose the N-th highest value from the set, or lowest element if there < N unique elements.  Is there a built-in for this? 
    z <- unique(x)                                       # This assums that you wan the N-th highest unique value.  Simply don't filter by unique if not.
    z[order(z, decreasing=TRUE)][min(N, length(z))]
  }

  xNq <- aggregate(Quantity ~ Group, data=x,   sel)      # Choose the N-th highest quantity within each "Group"
  xNm <- merge(x, xNq)                                   # Add the matching "Name" values
  x <- aggregate(Name ~ Quantity + Group, data=xNm, sel) # Choose the N-th highest Name in each group
  x[c('Group', 'Name', 'Quantity')]                      # Put into original order
}


> f(x, 2)
##   Group Name Quantity
## 1     A    u        3
## 2     C    y        0

> f(x, 1)
##   Group Name Quantity
## 1     A    w        4
## 2     C    x        2

1
# define ordering function, increasing on Quantity, decreasing on Name
in.order <- function(group) with(group, group[order(Quantity, -rank(Name)), ])

# set desired rank for each Group
N <- 2

# get Nth row by Group, according to in.order
group.rows <- by(x, x$Group, function(group) head(tail(in.order(group), N), 1))

# collapse rows into data.frame
do.call(rbind, group.rows)

#   Group Name Quantity
# A     A    u        3
# C     C    y        0

你看到 aggregate.data.frame 函数报错的原因是因为该函数根据 by 参数将 FUN 应用于每一列,而不是完整 data.frame 的每个子集(这就是上面所说的 by 函数的作用)。使用 aggregate 时,无论你提供什么给 FUN,都应该接受列,而不是 data.frame。在你的例子中,你试图像访问 data.frame 一样索引向量 y,因此出现了维度错误。

+1 这是最简单的解决方案!我可以建议在 in.order 函数中添加一个参数来控制升序或降序。 - agstudy
@agstudy 这是一个有效的建议。如果我自己要使用它,我肯定会这么做。不过,为了简洁起见,我打算保留原样。 - Matthew Plourde

0

我选择了

do.call(rbind, by(x, x$Group, function(x)
      x[order(-x$Quantity, x$Name),][1,]))

根据其他人的建议,我发现这个解决方案更适合我的思维过程,而其他发布的解决方案也让我受益匪浅。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接