按月分组获取数据集的百分位数

4

我有一个SQL表,其中有很多记录看起来像这样:

| Date       | Score |
+ -----------+-------+
| 01/01/2010 |     4 |
| 02/01/2010 |     6 |
| 03/01/2010 |    10 |
  ...
| 16/03/2010 |     2 |

我将其绘制在图表上,这样我就可以得到一个漂亮的线,表示随时间变化的得分。太好了。

现在,我需要在图表中包括平均分数,以便我们可以看到它随时间的变化,因此我只需将其添加到混合物中:

SELECT 
    YEAR(SCOREDATE) 'Year', MONTH(SCOREDATE) 'Month',
    MIN(SCORE) MinScore, 
    AVG(SCORE) AverageScore, 
    MAX(SCORE) MaxScore
FROM SCORES
GROUP BY YEAR(SCOREDATE), MONTH(SCOREDATE) 
ORDER BY YEAR(SCOREDATE), MONTH(SCOREDATE) 

到目前为止还没有问题。

问题在于,我怎样能够轻松地计算每个时间段的百分位数呢?我不确定这是否是正确的短语。我需要的总体上是:

  • 图表上显示得分的线(容易)
  • 图表上显示平均值的线(容易)
  • 图表上显示95%得分所占区间的线(困难)

第三个我不明白。我需要计算5%分位数的数字,这可以单独完成:

SELECT MAX(SubQ.SCORE) FROM
    (SELECT TOP 45 PERCENT SCORE 
    FROM SCORES
    WHERE YEAR(SCOREDATE) = 2010 AND MONTH(SCOREDATE) = 1
    ORDER BY SCORE ASC) AS SubQ

SELECT MIN(SubQ.SCORE) FROM
    (SELECT TOP 45 PERCENT SCORE 
    FROM SCORES
    WHERE YEAR(SCOREDATE) = 2010 AND MONTH(SCOREDATE) = 1
    ORDER BY SCORE DESC) AS SubQ

但我不知道如何获取所有月份的表格。

| Date       | Average | 45% | 55% |
+ -----------+---------+-----+-----+
| 01/01/2010 |      13 |  11 |  15 |
| 02/01/2010 |      10 |   8 |  12 |
| 03/01/2010 |       5 |   4 |  10 |
  ...
| 16/03/2010 |       7 |   7 |   9 |

目前我需要将这些内容加载到我的应用程序中,并自己计算数据。或者运行更多的单独查询并整合结果。


哪个版本的SQL Server?如果可用的话,您可能想查看NTILE函数的文档。 - AakashM
嘿,我本来是在看到这里的另一个问题后想要添加这个功能的,但是后来忘记了!我使用的是2000版本,所以无法使用NTILE :( - Cylindric
2个回答

3

哇,这真是一个大脑风暴。首先,我的测试表模式如下:

Create Table Scores 
( 
    Id int not null identity(1,1) primary key clustered
    , [Date] datetime not null
    , Score int not null
)

首先,我使用SQL 2008中的CTE计算值以检查我的答案,然后构建了一个在SQL 2000中应该有效的解决方案。因此,在SQL 2008中我们要这样做:

;With 
    SummaryStatistics As
    (
        Select Year([Date]) As YearNum
            , Month([Date]) As MonthNum
            , Min(Score) As MinScore
            , Max(Score) As MaxScore
            , Avg(Score) As AvgScore
        From Scores
        Group By Month([Date]), Year([Date])
    )
    , Percentiles As
    (
        Select Year([Date]) As YearNum
            , Month([Date]) As MonthNum
            , Score
            , NTile( 100 ) Over ( Partition By Month([Date]), Year([Date]) Order By Score ) As Percentile
        From Scores
    )
    , ReportedPercentiles As
    (
        Select YearNum, MonthNum
            , Min(Case When Percentile = 45 Then Score End) As Percentile45
            , Min(Case When Percentile = 55 Then Score End) As Percentile55
        From Percentiles
        Where Percentile In(45,55)
        Group By YearNum, MonthNum
    )
Select SS.YearNum, SS.MonthNum
    , SS.MinScore, SS.MaxScore, SS.AvgScore
    , RP.Percentile45, RP.Percentile55
From SummaryStatistics As SS
    Join ReportedPercentiles As RP
        On  RP.YearNum = SS.YearNum
            And RP.MonthNum = SS.MonthNum
Order By SS.YearNum, SS.MonthNum

现在介绍一个SQL 2000的解决方案。本质上,诀窍是使用两个临时表来计算分数的出现次数。

If object_id('tempdb..#Working') is not null
    DROP TABLE #Working
GO
Create Table #Working 
    (
    YearNum int not null
    , MonthNum int not null
    , Score int not null
    , Occurances int not null
    , Constraint PK_#Working Primary Key Clustered ( MonthNum, YearNum, Score )
    )
GO
Insert #Working(MonthNum, YearNum, Score, Occurances)
Select Month([Date]), Year([Date]), Score, Count(*)
From Scores
Group By Month([Date]), Year([Date]), Score
GO
If object_id('tempdb..#SummaryStatistics') is not null
    DROP TABLE #SummaryStatistics
GO
Create Table #SummaryStatistics
    (
    MonthNum int not null
    , YearNum int not null
    , Score int not null
    , Occurances int not null
    , CumulativeTotal int not null
    , Percentile float null
    , Constraint PK_#SummaryStatistics Primary Key Clustered ( MonthNum, YearNum, Score )
    )
GO
Insert #SummaryStatistics(YearNum, MonthNum, Score, Occurances, CumulativeTotal)
Select W2.YearNum, W2.MonthNum, W2.Score, W2.Occurances, Sum(W1.Occurances)-W2.Occurances
From #Working As W1
    Join #Working As W2 
        On W2.YearNum = W1.YearNum
            And W2.MonthNum = W1.MonthNum
Where W1.Score <= W2.Score
Group By W2.YearNum, W2.MonthNum, W2.Score, W2.Occurances

Update #SummaryStatistics
Set Percentile = SS.CumulativeTotal * 100.0 / MonthTotal.Total
From #SummaryStatistics As SS
    Join    (
            Select SS1.YearNum, SS1.MonthNum, Max(SS1.CumulativeTotal) As Total
            From #SummaryStatistics As SS1
            Group By SS1.YearNum, SS1.MonthNum
            ) As MonthTotal
        On MonthTotal.YearNum = SS.YearNum
            And MonthTotal.MonthNum = SS.MonthNum

Select GeneralStats.*, Percentiles.Percentile45, Percentiles.Percentile55
From    (
        Select  Year(S1.[Date]) As YearNum
            , Month(S1.[Date]) As MonthNum
            , Min(S1.Score) As MinScore
            , Max(S1.Score) As MaxScore
            , Avg(S1.Score) As AvgScore
        From Scores As S1
        Group By Month(S1.[Date]), Year(S1.[Date])
        ) As GeneralStats
    Join    (
            Select SS1.YearNum, SS1.MonthNum
                , Min(Case When SS1.Percentile >= 45 Then Score End) As Percentile45
                , Min(Case When SS1.Percentile >= 55 Then Score End) As Percentile55
            From #SummaryStatistics As SS1
            Group By SS1.YearNum, SS1.MonthNum 
            ) As Percentiles
        On Percentiles.YearNum = GeneralStats.YearNum
            And Percentiles.MonthNum = GeneralStats.MonthNum

哇,这是一组相当复杂的操作 :) 不过似乎能正常工作,所以太棒了,谢谢! - Cylindric

1

没有数据,我不确定我是否做得对,但也许这可以帮助您每年只使用两个查询而不是24个...

SELECT MAX(SubQ.SCORE), MyMonth  FROM
    (SELECT TOP 45 PERCENT SCORE , MONTH(SCOREDATE) as MyMonth 
    FROM SCORES
    WHERE YEAR(SCOREDATE) = 2010
    ORDER BY SCORE ASC) AS SubQ
group by  MyMonth   

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接