如何在Google BigQuery中高效地计算数字序列的中位数?

13

我需要高效地计算谷歌BigQuery中数值序列的中位数值。这是否可能呢?


你的接受率比较低。在 Stack Overflow 上,你需要使用答案左侧的勾选框标记已接受的答案,并放置在投票下方。这将提高你的接受率。请访问以下链接了解详情:http://meta.stackoverflow.com/questions/5234/how-does-accepting-an-answer-work#5235 - Pentium10
请参考以下链接:https://dev59.com/563la4cB1Zd3GeqPTdsE - Felipe Hoffa
3个回答

16

使用PERCENTILE_CONT窗口函数可以实现。

根据ORDER BY子句对组中的值进行排序,并在这些值之间进行线性插值,返回基于此计算出的值。

输入值必须介于0和1之间。

此窗口函数要求在OVER子句中使用ORDER BY。

因此,一个示例查询可能如下所示(max()只是用来跨group by运行但不作为数学逻辑使用,不会让你感到困惑):

SELECT room,
      max(median) FROM   (SELECT room,
         percentile_cont(0.5) OVER (PARTITION BY room
                                    ORDER BY temperature) AS median    FROM
    (SELECT 1 AS room,
            11 AS temperature),
    (SELECT 1 AS room,
            12 AS temperature),
    (SELECT 1 AS room,
            14 AS temperature),
    (SELECT 1 AS room,
            19 AS temperature),
    (SELECT 1 AS room,
            13 AS temperature),
    (SELECT 2 AS room,
            20 AS temperature),
    (SELECT 2 AS room,
            21 AS temperature),
    (SELECT 2 AS room,
            29 AS temperature),
    (SELECT 3 AS room,
            30 AS temperature)) GROUP BY room

这将返回:

+------+-------------+
| room | temperature |
+------+-------------+
|    1 |          13 |
|    2 |          21 |
|    3 |          30 |
+------+-------------+

2
我们能否请您提供一个更加清晰简明的查询吗?我无法理解上述内容。 - Manish Agrawal
@ManishAgrawal 尝试分段运行,你最终会理解,这个查询很简单。也许对你来说新的是 OVER() 这个东西,你需要进一步阅读,它是窗口函数的基础。如果 from 子句让你困惑,我尝试复制了一个表格结果,这样你就可以复制粘贴并按原样运行此查询。 - Pentium10
1
房间1的数值为11、12、14、19、13,不应该是中位数14吗? - Andres Urrego Angel
@AndresUrregoAngel 中位数是从一系列有序的值中派生出来的。当然,在您上面描述的顺序中,值“14”位于中间。但是,这不是一个有序列表。有序列表应为11、12、13、14、19。因此,“13”是正确的中位数值。 - Fab Dot

7

当您不需要绝对精确的结果,而近似值就足够时,可以使用NTH和QUANTILES聚合函数的组合作为替代方案。这种方法的优点是比分析窗口函数更具可扩展性,但缺点是它只提供近似结果。

SELECT room,
       NTH(50, QUANTILES(temperature, 101)) FROM
    (SELECT 1 AS room,
            11 AS temperature),
    (SELECT 1 AS room,
            12 AS temperature),
    (SELECT 1 AS room,
            14 AS temperature),
    (SELECT 1 AS room,
            19 AS temperature),
    (SELECT 1 AS room,
            13 AS temperature),
    (SELECT 2 AS room,
            20 AS temperature),
    (SELECT 2 AS room,
            21 AS temperature),
    (SELECT 2 AS room,
            29 AS temperature),
    (SELECT 3 AS room,
            30 AS temperature) GROUP BY room

这会返回

room temperature 
1    13  
2    21  
3    30

我认为你需要使用NTH(51, QUANTILES(temperature, 101))来计算中位数,因为NTH是基于1的。请参考https://cloud.google.com/bigquery/query-reference#quantiles - Richard Poole

7

2018年更新,提供更多指标:

BigQuery SQL:平均值、几何平均值、去除异常值、中位数


为了自己的记忆,这里有与出租车数据一起使用的工作查询语句:

近似分位数:

SELECT MONTH(pickup_datetime) month, NTH(51, QUANTILES(tip_amount,101)) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1

与PERCENTILE_DISC函数产生相同的结果:
SELECT month, FIRST(median) median
FROM (
  SELECT MONTH(pickup_datetime) month, tip_amount, PERCENTILE_DISC(0.5) OVER(PARTITION BY month ORDER BY tip_amount) median
  FROM [nyc-tlc:green.trips_2015]
  WHERE tip_amount > 0
)
GROUP BY 1
ORDER BY 1

标准SQL:

#StandardSQL
SELECT DATE_TRUNC(DATE(pickup_datetime), MONTH) month, APPROX_QUANTILES(tip_amount,1000)[OFFSET(500)] median
FROM `nyc-tlc.green.trips_2015`
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接