SQL移动平均

16

如何在SQL中创建移动平均线?

当前数据表:

Date             Clicks 
2012-05-01       2,230
2012-05-02       3,150
2012-05-03       5,520
2012-05-04       1,330
2012-05-05       2,260
2012-05-06       3,540
2012-05-07       2,330

期望的表格或输出结果:

Date             Clicks    3 day Moving Average
2012-05-01       2,230
2012-05-02       3,150
2012-05-03       5,520          4,360
2012-05-04       1,330          3,330
2012-05-05       2,260          3,120
2012-05-06       3,540          3,320
2012-05-07       2,330          3,010

你使用的数据库系统是什么? - Brian Webster
@BrianWebster:他在评论我的(现已删除的)帖子中说:他正在使用Hive。但是你已经删除了它的标签。 - user330315
好的,已修复 - 我真的没有意识到那是一个数据库系统。 - Brian Webster
13个回答

19

这是一个经典的Joe Celko问题。我不知道使用哪个DBMS平台。但无论如何,Joe在10多年前就能用标准SQL回答。

Joe Celko SQL Puzzles and Answers引用: "最后一次更新尝试表明我们可以使用谓词来构建一个查询,该查询将给出移动平均值:"

SELECT S1.sample_time, AVG(S2.load) AS avg_prev_hour_load
FROM Samples AS S1, Samples AS S2
WHERE S2.sample_time
BETWEEN (S1.sample_time - INTERVAL 1 HOUR)
AND S1.sample_time
GROUP BY S1.sample_time;

“额外列”和“查询方法”,哪个更好?从技术上讲,“查询方法”更好,因为“更新方法”会使数据库失去规范性。但是,如果记录的历史数据不会发生变化,并且计算移动平均成本很高,则可以考虑使用“额外列”方法。
MS SQL示例:
CREATE TABLE #TestDW
( Date1 datetime,
  LoadValue Numeric(13,6)
);

INSERT INTO #TestDW VALUES('2012-06-09' , '3.540' );
INSERT INTO #TestDW VALUES('2012-06-08' , '2.260' );
INSERT INTO #TestDW VALUES('2012-06-07' , '1.330' );
INSERT INTO #TestDW VALUES('2012-06-06' , '5.520' );
INSERT INTO #TestDW VALUES('2012-06-05' , '3.150' );
INSERT INTO #TestDW VALUES('2012-06-04' , '2.230' );

SQL难题查询:

SELECT S1.date1,  AVG(S2.LoadValue) AS avg_prev_3_days
FROM #TestDW AS S1, #TestDW AS S2
WHERE S2.date1
    BETWEEN DATEADD(d, -2, S1.date1 )
    AND S1.date1
GROUP BY S1.date1
order by 1;

1
谢谢提供信息,但我很难翻译出它是如何解决问题的。您能否提供一下您在问题中使用的查询语句? - Don P
1
这甚至更好,因为它可以修改以查找N个月的移动平均值。 - Faiz

8

一种方法是在同一张表上多次进行连接。

select
 (Current.Clicks 
  + isnull(P1.Clicks, 0)
  + isnull(P2.Clicks, 0)
  + isnull(P3.Clicks, 0)) / 4 as MovingAvg3
from
 MyTable as Current
 left join MyTable as P1 on P1.Date = DateAdd(day, -1, Current.Date)
 left join MyTable as P2 on P2.Date = DateAdd(day, -2, Current.Date)
 left join MyTable as P3 on P3.Date = DateAdd(day, -3, Current.Date)

调整ON-Clauses中的DateAdd组件,以匹配您是想要严格从过去到现在还是从几天前到几天后进行移动平均。

  • 这对于需要对少量数据点进行移动平均的情况非常有效。
  • 对于包含多个数据点的移动平均值,这不是最佳解决方案。

左连接它们(看看前两个没有)。 - Chris Pfohl
1
对于大型表,进行4个连接操作不会成为一个相当昂贵的操作吗? - Don P
根据数据而定,但根据我的经验,这是一个相当快速的操作。 - Brian Webster

3
select t2.date, round(sum(ct.clicks)/3) as avg_clicks
from
(select date from clickstable) as t2,
(select date, clicks from clickstable) as ct
where datediff(t2.date, ct.date) between 0 and 2
group by t2.date

示例在这里

显然,您可以将时间间隔更改为所需的任何值。您还可以使用count()代替魔数以使更改更容易,但这也会使它变慢。


你的前两个条目是1天和2天的平均值。问题要求这些条目为“NULL”。 - Michael Currie

2

滚动平均值的通用模板,适用于大数据集

WITH moving_avg AS (
  SELECT 0 AS [lag] UNION ALL
  SELECT 1 AS [lag] UNION ALL
  SELECT 2 AS [lag] UNION ALL
  SELECT 3 AS [lag] --ETC
)
SELECT
  DATEADD(day,[lag],[date]) AS [reference_date],
  [otherkey1],[otherkey2],[otherkey3],
  AVG([value1]) AS [avg_value1],
  AVG([value2]) AS [avg_value2]
FROM [data_table]
CROSS JOIN moving_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];

今日免费次数已满, 请开通会员/明日再来
WITH weighted_avg AS (
  SELECT 0 AS [lag], 1.0 AS [weight] UNION ALL
  SELECT 1 AS [lag], 0.6 AS [weight] UNION ALL
  SELECT 2 AS [lag], 0.3 AS [weight] UNION ALL
  SELECT 3 AS [lag], 0.1 AS [weight] --ETC
)
SELECT
  DATEADD(day,[lag],[date]) AS [reference_date],
  [otherkey1],[otherkey2],[otherkey3],
  AVG([value1] * [weight]) / AVG([weight]) AS [wavg_value1],
  AVG([value2] * [weight]) / AVG([weight]) AS [wavg_value2]
FROM [data_table]
CROSS JOIN weighted_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];

权重的有趣方法。不过对于更离散的时间点(时间戳而非日期)可能效果不佳。 - msciwoj
1
除了学术练习之外,固定权重滚动平均值在非均匀间隔上有什么用途呢?难道你不是先对数据进行直方图处理或根据间隔大小计算权重吗? - Anon
一定要统一。根据当前时间点的距离,将其投入适当的权重桶中。例如,“对于距离当前数据点24小时以内的数据点,取权重=1;对于距离当前数据点48小时以内的数据点,取权重=0.5...”。在这种情况下,连续数据点(如上午6:12和晚上11:48)之间的距离有多少是很重要的...我能想到的一个用例是尝试在数据点不足的地方平滑直方图。 - msciwoj

2
select *
        , (select avg(c2.clicks) from #clicks_table c2 
            where c2.date between dateadd(dd, -2, c1.date) and c1.date) mov_avg
from #clicks_table c1

1
假设 x 是要取平均值的数值,xDate 是日期数值:
从我的表中选择 avg(x),其中 xDate 在 dateadd(d, -2, xDate) 和 xDate 之间。

1

In hive, maybe you could try

select date, clicks, avg(clicks) over (order by date rows between 2 preceding and current row) as moving_avg from clicktable;

1

使用不同的连接谓词:

SELECT current.date
       ,avg(periods.clicks)
FROM current left outer join current as periods
       ON current.date BETWEEN dateadd(d,-2, periods.date) AND periods.date
GROUP BY current.date HAVING COUNT(*) >= 3

使用having语句可以防止返回任何没有至少N个值的日期。


这将不显示5月1日和5月2日的行,而提问者希望看到它们的NULL值。 - Michael Currie

0

注意:这不是一个答案,而是Diego Scaravaggi答案的增强代码示例。我将其发布为答案,因为评论部分不足够。请注意,我已经对移动平均线的周期进行了参数化。

declare @p int = 3
declare @t table(d int, bal float)
insert into @t values
(1,94),
(2,99),
(3,76),
(4,74),
(5,48),
(6,55),
(7,90),
(8,77),
(9,16),
(10,19),
(11,66),
(12,47)

select a.d, avg(b.bal)
from
       @t a
       left join @t b on b.d between a.d-(@p-1) and a.d
group by a.d

0

我不确定你期望的结果(输出)是否展示了经典的“三日简单移动平均线”。因为,例如,根据定义,前三个数字组成的三元组应该是:

ThreeDaysMovingAverage = (2.230 + 3.150 + 5.520) / 3 = 3.6333333

但是你期望的是4.360,这很令人困惑。

尽管如此,我建议采用以下解决方案,它使用窗口函数AVG。这种方法比其他答案中介绍的SELF-JOIN更高效(清晰且资源消耗较少),我很惊讶没有人提供更好的解决方案。

-- Oracle-SQL dialect 
with
  data_table as (
     select date '2012-05-01' AS dt, 2.230 AS clicks from dual union all
     select date '2012-05-02' AS dt, 3.150 AS clicks from dual union all
     select date '2012-05-03' AS dt, 5.520 AS clicks from dual union all
     select date '2012-05-04' AS dt, 1.330 AS clicks from dual union all
     select date '2012-05-05' AS dt, 2.260 AS clicks from dual union all
     select date '2012-05-06' AS dt, 3.540 AS clicks from dual union all
     select date '2012-05-07' AS dt, 2.330 AS clicks from dual  
  ),
  param as (select 3 days from dual)
select
   dt     AS "Date",
   clicks AS "Clicks",

   case when rownum >= p.days then 
       avg(clicks) over (order by dt
                          rows between p.days - 1 preceding and current row)
   end    
          AS "3 day Moving Average"
from data_table t, param p;

你可以看到AVG被包裹在case when rownum >= p.days then中,以强制在前几行中出现NULL,因为“3天移动平均值”在这些行中是没有意义的。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接