我有一个包含约十亿行的数据库,我们需要确定该组中位数年龄。对十亿行进行排序很困难,但是如果您汇总可以找到的不同值(年龄从0到100),您可以对此列表进行排序,并使用一些算术魔法来查找任何想要的百分位数,如下所示:
with rawData(count_value) as
(
select p.YEAR_OF_BIRTH
from dbo.PERSON p
),
overallStats (avg_value, stdev_value, min_value, max_value, total) as
(
select avg(1.0 * count_value) as avg_value,
stdev(count_value) as stdev_value,
min(count_value) as min_value,
max(count_value) as max_value,
count(*) as total
from rawData
),
aggData (count_value, total, accumulated) as
(
select count_value,
count(*) as total,
SUM(count(*)) OVER (ORDER BY count_value ROWS UNBOUNDED PRECEDING) as accumulated
FROM rawData
group by count_value
)
select o.total as count_value,
o.min_value,
o.max_value,
o.avg_value,
o.stdev_value,
MIN(case when d.accumulated >= .50 * o.total then count_value else o.max_value end) as median_value,
MIN(case when d.accumulated >= .10 * o.total then count_value else o.max_value end) as p10_value,
MIN(case when d.accumulated >= .25 * o.total then count_value else o.max_value end) as p25_value,
MIN(case when d.accumulated >= .75 * o.total then count_value else o.max_value end) as p75_value,
MIN(case when d.accumulated >= .90 * o.total then count_value else o.max_value end) as p90_value
from aggData d
cross apply overallStats o
GROUP BY o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value
;
这个查询取决于你的数据库支持窗口函数(包括 ROWS UNBOUNDED PRECEDING),但如果不支持,可以通过将 aggData CTE 自连接并将所有先前的总计聚合到“累积”列中来解决,该列用于确定哪个值包含指定的百分位数。上面的示例计算了 p10、p25、p50(中位数)、p75 和 p90。
-Chris