我需要高效地计算谷歌BigQuery中数值序列的中位数值。这是否可能呢?
我需要高效地计算谷歌BigQuery中数值序列的中位数值。这是否可能呢?
使用PERCENTILE_CONT窗口函数可以实现。
根据ORDER BY子句对组中的值进行排序,并在这些值之间进行线性插值,返回基于此计算出的值。
输入值必须介于0和1之间。
此窗口函数要求在OVER子句中使用ORDER BY。
因此,一个示例查询可能如下所示(max()只是用来跨group by运行但不作为数学逻辑使用,不会让你感到困惑):
SELECT room,
max(median) FROM (SELECT room,
percentile_cont(0.5) OVER (PARTITION BY room
ORDER BY temperature) AS median FROM
(SELECT 1 AS room,
11 AS temperature),
(SELECT 1 AS room,
12 AS temperature),
(SELECT 1 AS room,
14 AS temperature),
(SELECT 1 AS room,
19 AS temperature),
(SELECT 1 AS room,
13 AS temperature),
(SELECT 2 AS room,
20 AS temperature),
(SELECT 2 AS room,
21 AS temperature),
(SELECT 2 AS room,
29 AS temperature),
(SELECT 3 AS room,
30 AS temperature)) GROUP BY room
这将返回:
+------+-------------+
| room | temperature |
+------+-------------+
| 1 | 13 |
| 2 | 21 |
| 3 | 30 |
+------+-------------+
11、12、13、14、19
。因此,“13”是正确的中位数值。 - Fab Dot当您不需要绝对精确的结果,而近似值就足够时,可以使用NTH和QUANTILES聚合函数的组合作为替代方案。这种方法的优点是比分析窗口函数更具可扩展性,但缺点是它只提供近似结果。
SELECT room,
NTH(50, QUANTILES(temperature, 101)) FROM
(SELECT 1 AS room,
11 AS temperature),
(SELECT 1 AS room,
12 AS temperature),
(SELECT 1 AS room,
14 AS temperature),
(SELECT 1 AS room,
19 AS temperature),
(SELECT 1 AS room,
13 AS temperature),
(SELECT 2 AS room,
20 AS temperature),
(SELECT 2 AS room,
21 AS temperature),
(SELECT 2 AS room,
29 AS temperature),
(SELECT 3 AS room,
30 AS temperature) GROUP BY room
这会返回
room temperature
1 13
2 21
3 30
NTH(51, QUANTILES(temperature, 101))
来计算中位数,因为NTH
是基于1的。请参考https://cloud.google.com/bigquery/query-reference#quantiles - Richard Poole2018年更新,提供更多指标:
BigQuery SQL:平均值、几何平均值、去除异常值、中位数
为了自己的记忆,这里有与出租车数据一起使用的工作查询语句:
近似分位数:
SELECT MONTH(pickup_datetime) month, NTH(51, QUANTILES(tip_amount,101)) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1
SELECT month, FIRST(median) median
FROM (
SELECT MONTH(pickup_datetime) month, tip_amount, PERCENTILE_DISC(0.5) OVER(PARTITION BY month ORDER BY tip_amount) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
)
GROUP BY 1
ORDER BY 1
标准SQL:
#StandardSQL
SELECT DATE_TRUNC(DATE(pickup_datetime), MONTH) month, APPROX_QUANTILES(tip_amount,1000)[OFFSET(500)] median
FROM `nyc-tlc.green.trips_2015`
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1