数据框中使用dplyr进行加权百分位数计算

Question

数据框中使用dplyr进行加权百分位数计算

4

我正在尝试计算数据框中某个值的百分位排名，并且我也有在数据框中与之相关联的频率进行加权。但我难以想出一种解决方案，可以计算原始值的百分位，就好像整体分布是由该值乘以频率以及其他所有值乘以该频率得到。

例如：

groceries <- tribble(
  ~item, ~price, ~freq,
  "apple",   1, 20,
  "banana",   2, 5,
  "carrot",   3, 1
)

groceries %>% 
    mutate(reg_ptile = percent_rank(price),
           wtd_ptile = weighted_percent_rank(price, wt = freq))

# the expected result would be:

# A tibble: 3 x 5
  item   price  freq reg_ptile wtd_ptile
  <chr>  <dbl> <dbl> <dbl>     <dbl>
1 apple      1    20  0.0      0.0
2 banana     2     5  0.5      0.8
3 carrot     3     1  1.0      1.0

percent_rank() 是一个实际的dplyr函数。如何编写函数weighted_percent_rank()？不确定如何在数据框和管道中使用它。如果解决方案也适用于分组，那将是很好的。

编辑：使用uncount()并不起作用，因为对我使用的数据进行不计数会导致8000亿行。还有其他想法吗？

- Adhi R.

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Allan Cameron · Accepted Answer

您可以使用tidyr::uncount将行数按频率扩展以获取加权百分位数，然后使用summarize将其缩小，如下所示的正则表达式：

library(dplyr)

groceries <- tribble(
  ~item, ~price, ~freq,
  "apple",   1, 10,
  "banana",   2, 5,
  "carrot",   3, 1
)

groceries %>% 
  tidyr::uncount(freq) %>% 
  mutate(wtd_ptile = percent_rank(price)) %>%
  group_by(item) %>%
  summarize_all(~.[1]) %>%
  mutate(ptile = percent_rank(price))
#> # A tibble: 3 x 4
#>   item   price wtd_ptile ptile
#>   <chr>  <dbl>     <dbl> <dbl>
#> 1 apple      1     0       0  
#> 2 banana     2     0.667   0.5
#> 3 carrot     3     1       1

请注意，您可以选择不同的排名函数，但在这种情况下，加权百分位数为0.667（10/（16-1）），而不是0.8。

编辑： 一种不涉及创建数十亿行的替代方案：

groceries %>% 
  arrange(price) %>% 
  mutate(wtd_ptile = lag(cumsum(freq), default = 0)/(sum(freq) - 1))
#> # A tibble: 3 x 4
#>   item   price  freq wtd_ptile
#>   <chr>  <dbl> <dbl>     <dbl>
#> 1 apple      1    10     0    
#> 2 banana     2     5     0.667
#> 3 carrot     3     1     1