如何按组计算百分位数?

5

我有一个data.table,其中有一万多行数据,它看起来像这样:

DT1 <- data.table(ID = 1:10,
                  result_2010 = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE"),
                  result_2011 = c("FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE"),
                  years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))

    ID result_2010 result_2011 years
 1:  1        TRUE       FALSE  15.0
 2:  2       FALSE        TRUE  16.5
 3:  3        TRUE       FALSE  31.0
 4:  4       FALSE       FALSE   1.0
 5:  5       FALSE       FALSE  40.2
 6:  6        TRUE       FALSE   0.3
 7:  7       FALSE        TRUE  12.0
 8:  8       FALSE       FALSE  22.7
 9:  9        TRUE       FALSE  19.0
10: 10       FALSE        TRUE  12.0

对于 "result_2010" 和 "result_2011",我想针对 "years" 进行百分位数分析,但仅在个人的值为 "TRUE" 时进行。 我尝试的代码似乎有效,但它返回了相同的结果,这肯定是不正确的:
DT1 %>%
  group_by(result_2010 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))
DT1 %>%
  group_by(result_2011 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))

有人能帮忙纠正我的代码吗?


1
你可能想使用 filter 而不是 group_by,即 filter(result_2010 == "TRUE") - Ronak Shah
1
你使用"TRUE"/"FALSE"而不是更直接的TRUE/FALSE有特别的原因吗?我发现高效的处理通常始于高效的数据。 - r2evans
任何一个(全部?)答案是否解决了你的问题,Gabesz? - r2evans
我们似乎用过多的解决方案和它们的复杂性来压倒提问者。Gabesh可能害怕尝试所有这些。更不用说决定给谁15个声望点了。而他自己因为他的问题获得了+30分 :-(!PS.我再次检查了你的解决方案,当DT1中的变量result_2010result_2011logicalcharacter类型时,每次都会出现错误“Error ... object 'value' not found”。 - Marek Fiołka
5个回答

4

使用 meltaggregate

library(data.table)
melt(DT1, c(1, 4), 2:3) |>
  transform(variable=substring(variable, 8)) |>
  subset(value == TRUE) |>
  with(aggregate(list(q=years), list(year=variable), \(x)
                 c(quantile(x), mean=mean(x))))
#   year   q.0%  q.25%  q.50%  q.75% q.100% q.mean
# 1 2010  0.300 11.325 17.000 22.000 31.000 16.325
# 2 2011 12.000 12.000 12.000 14.250 16.500 13.500

注意:请使用R>=4.1版本来使用 |>管道符以及\(x)函数的缩写形式(或者写成function(x))。


1
我特别喜欢这个程序中类似于dplyr的管道流程,但没有使用dplyr。可惜(在我看来),按组进行的transform(虽然此处未使用,但通常情况下)似乎不太流畅(即需要使用ave)。 - r2evans
@r2evans,你有没有检查过ave代码,它在内部隐藏了lapply(split()) - jay.sf
1
是的,过去我看过它,使用\split<-`非常有启发性。一般来说,dplyr的group_by(grp) %>% mutate(a = ...)似乎不太适合翻译成transform(a = ave(a, grp, FUN = (x) ...))`,而且在同时转换多个变量时效果更差。 - r2evans

4
library(tidyverse)
DT1 <- tibble(ID = 1:10,
                  result_2010 = c(TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE),
                  result_2011 = c(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),
                  years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))

fQuantMean = function(x) t(quantile(x)) %>% 
  as_tibble() %>% bind_cols(mean = mean(x))

tibble(
  year = c(2010, 2011),
  data = list(DT1$years[DT1$result_2010],
              DT1$years[DT1$result_2011])
) %>% group_by(year) %>% 
  group_modify(~fQuantMean(.x$data[[1]]))

输出

# A tibble: 2 x 7
# Groups:   year [2]
   year  `0%` `25%` `50%` `75%` `100%`  mean
  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
1  2010   0.3  11.3    17  22     31    16.3
2  2011  12    12      12  14.2   16.5  13.5

对于任何有兴趣的人的更新!

亲爱的同事们,你们可以看到,每个任务都可以用几种不同的方式解决。因此,我决定比较这里提出的方法。由于 @Gabesz 提到他有10000个观察结果,我决定从性能方面检查每个解决方案。

n=10000
set.seed(1234)
DT1 <- tibble(ID = 1:n,
              result_2010 = sample(c(TRUE, FALSE), n, replace = TRUE),
              result_2011 = sample(c(TRUE, FALSE), n, replace = TRUE),
              years = rnorm(n, 20, 5))

然后我进行了一些基准测试

fQuantMean = function(x) t(quantile(x)) %>% 
  as_tibble() %>% bind_cols(mean = mean(x))

fFiolka = function(){
  tibble(
    year = c(2010, 2011),
    data = list(DT1$years[DT1$result_2010],
                DT1$years[DT1$result_2011])
  ) %>% group_by(year) %>% 
    group_modify(~fQuantMean(.x$data[[1]]))
}
fFiolka()
# # A tibble: 2 x 7
# # Groups:   year [2]
#    year     `0%` `25%` `50%` `75%` `100%`  mean
#    <dbl>    <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
# 1  2010 -0.00697  16.4  19.9  23.3   38.6  19.9
# 2  2011 -0.633    16.5  20.0  23.4   38.6  20.0

library(data.table)

fjay_sf = function(){
  melt(DT1, c(1, 4), 2:3) |>
    transform(variable=substring(variable, 8)) |>
    subset(value == TRUE) |>
    with(aggregate(list(q=years), list(year=variable), \(x)
                   c(quantile(x), mean=mean(x))))
}
fjay_sf()
# year         q.0%        q.25%        q.50%        q.75%       q.100%       q.mean
# 1 2010 -0.006968224 16.447077579 19.947385976 23.348571278 38.636456902 19.944574420
# 2 2011 -0.633138113 16.530534403 20.043636844 23.424378551 38.636456902 20.013130400
# Warning message:
#   In melt(DT1, c(1, 4), 2:3) :
#   The melt generic in data.table has been passed a tbl_df and will attempt to redirect 
#   to the relevant reshape2 method; please note that reshape2 is deprecated, and this 
#   redirection is now deprecated as well. To continue using melt methods from reshape2
#    while both libraries are attached, e.g. melt.list, you can prepend the namespace 
#    like reshape2::melt(DT1). In the next version, this warning will become an error.


cols <- grep('result_', names(DT1), value = TRUE)

get_stats_fun <- function(DT, col) {
  DT %>%
    filter(.data[[col]] == "TRUE") %>%
    summarise("quantile" = list(round(quantile(years,c(.10,.25,.50,.75,.90)),1)),
              "median" = round(median(years), 1),
              "Mean" = round(mean(years),1)) %>%
    unnest_wider(quantile)
}

fShah = function(){
map_df(cols, ~get_stats_fun(DT1, .x), .id = 'Year') %>%
  mutate(Year = cols)
}
fShah()
# # A tibble: 2 x 8
#   Year        `10%` `25%` `50%` `75%` `90%` median  Mean
#   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
# 1 result_2010  13.5  16.4  19.9  23.3  26.4   19.9  19.9
# 2 result_2011  13.4  16.5  20    23.4  26.6   20    20  

library(microbenchmark)
ggplot2::autoplot(microbenchmark(fFiolka(), fjay_sf(), fShah(), times=100))

enter image description here

希望上面的图表已经解释清楚了。
@r2evans,请不要责怪我跳过了您的解决方案,因为它导致了一些错误。

2
您可以编写一个函数并在每个result列上运行它。
library(tidyverse)

cols <- grep('result_', names(DT1), value = TRUE)

get_stats_fun <- function(DT, col) {
  DT %>%
    filter(.data[[col]] == "TRUE") %>%
    summarise("quantile" = list(round(quantile(years,c(.10,.25,.50,.75,.90)),1)),
              "median" = round(median(years), 1),
              "Mean" = round(mean(years),1)) %>%
    unnest_wider(quantile)
}

map_df(cols, ~get_stats_fun(DT1, .x), .id = 'Year') %>%
  mutate(Year = cols)

#  Year        `10%` `25%` `50%` `75%` `90%` median  Mean
#  <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
#1 result_2010   4.7  11.3    17  22    27.4     17  16.3
#2 result_2011  12    12      12  14.2  15.6     12  13.5

2
一个 melt/dcast 选项:
library(data.table)
tmp <- melt(DT1, c("ID", "years"), variable.name = "year"
  )[ value == "TRUE",
   ][, .(variable = c(paste0("q", c(10, 25, 50, 75, 90)), "mu"),
         value = c(quantile(years, c(0.1, 0.25, 0.5, 0.75, 0.9)), 
                  mean(years)))
    , by = .(year)]
tmp
#            year variable  value
#          <fctr>   <char>  <num>
#  1: result_2010      q10  4.710
#  2: result_2010      q25 11.325
#  3: result_2010      q50 17.000
#  4: result_2010      q75 22.000
#  5: result_2010      q90 27.400
#  6: result_2010       mu 16.325
#  7: result_2011      q10 12.000
#  8: result_2011      q25 12.000
#  9: result_2011      q50 12.000
# 10: result_2011      q75 14.250
# 11: result_2011      q90 15.600
# 12: result_2011       mu 13.500

dcast(tmp, year ~ variable, value.var = "value")
#           year     mu   q10    q25   q50   q75   q90
#         <fctr>  <num> <num>  <num> <num> <num> <num>
# 1: result_2010 16.325  4.71 11.325    17 22.00  27.4
# 2: result_2011 13.500 12.00 12.000    12 14.25  15.6

您可以完全控制名称,只需在"variable"列中按顺序分配它们(您可能会选择更好的命名)。

或者一个孤立的melt

melt(DT1, c("ID", "years"), variable.name = "year"
  )[ value == "TRUE",
   ][, setNames(as.list(c(quantile(years, c(0.1, 0.25, 0.5, 0.75, 0.9)), 
                          mean(years))),
                c(paste0("q", c(10, 25, 50, 75, 90)), "mu"))
    , by = .(year)][]
#           year   q10    q25   q50   q75   q90     mu
#         <fctr> <num>  <num> <num> <num> <num>  <num>
# 1: result_2010  4.71 11.325    17 22.00  27.4 16.325
# 2: result_2011 12.00 12.000    12 14.25  15.6 13.500

现在可以轻松地控制名称,只需在setNames的第二个参数中进行设置。前提是,在data.table处理中返回一个命名的list将把它转换为命名列,因此任何执行此操作的函数都可以轻松使用。


1

这将是我的第一个答案,如果我做错了什么,请原谅。仔细阅读您的问题后,您希望有人帮助您改进代码。请看这里。

library(tidyverse)
library(data.table)

DT1 <- data.table(ID = 1:10,
                  result_2010 = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE"),
                  result_2011 = c("FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE"),
                  years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))
DT1 %>%
  filter(result_2010 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))
DT1 %>%
  filter(result_2011 == "TRUE") %>%
  summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
            "25.quantile"= round(quantile(years,c(.25)),digits=1),
            "Median"= round(quantile(years,c(.50)),digits=1),
            "75.quantile"= round(quantile(years,c(.75)),digits=1),
            "90.quantile"= round(quantile(years,c(.90)),digits=1),
            "Mean" = round(mean(years),digits=1))

在第一种情况下,它返回值4.7、11.3、17、22、27.4、16.3。在第二种情况下,它返回12、12、12、14.2、15.6、13.5。我看到这里有很多不同的答案。虽然我诚实地承认有些我还不理解。我真的喜欢使用quantile%>% tibble%>% bind_cols的解决方案。但是,我因为指向这个答案有帮助而声望低下,这让我有点担心。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接