最有效的连接两个时间序列的方法

4
假设我有这样一张表格:
 CREATE TABLE time_series (
        snapshot_date DATE,
        sales INTEGER,
PRIMARY KEY (snapshot_date));

使用这样的值:

INSERT INTO time_series SELECT '2017-01-01'::DATE AS snapshot_date,10 AS sales;
INSERT INTO time_series SELECT '2017-01-02'::DATE AS snapshot_date,4 AS sales;
INSERT INTO time_series SELECT '2017-01-03'::DATE AS snapshot_date,13 AS sales;
INSERT INTO time_series SELECT '2017-01-04'::DATE AS snapshot_date,7 AS sales;
INSERT INTO time_series SELECT '2017-01-05'::DATE AS snapshot_date,15 AS sales;
INSERT INTO time_series SELECT '2017-01-06'::DATE AS snapshot_date,8 AS sales;

我希望能够做到这个:
SELECT a.snapshot_date, 
       AVG(b.sales) AS sales_avg,
       COUNT(*) AS COUNT
  FROM time_series AS a
  JOIN time_series AS b
       ON a.snapshot_date > b.snapshot_date
 GROUP BY a.snapshot_date

产生的结果如下所示:
*---------------*-----------*-------*
| snapshot_date | sales_avg | count |
*---------------*-----------*-------*
|  2017-01-02   |   10.0    |    1  |
|  2017-01-03   |   7.0     |    2  |
|  2017-01-04   |   9.0     |    3  |
|  2017-01-05   |   8.5     |    4  |
|  2017-01-06   |   9.8     |    5  |
-------------------------------------

在这个例子中,由于行数很少,所以查询非常快。问题是我必须处理数百万行数据,在Redshift上(与Postgres类似的语法),我的查询需要几天才能运行。速度极慢,而且这是我最常用的查询模式之一。我怀疑问题是由于数据中O(n^2)的增长而产生的,而不是更可取的O(n)。

在Python中,我采用了一种O(n)的实现方法,如下所示:

rows = [('2017-01-01',10),
        ('2017-01-02',4),
        ('2017-01-03',13),
        ('2017-01-04',7),
        ('2017-01-05',15),
        ('2017-01-06',8)]
sales_total_previous = 0
count = 0
for index, row in enumerate(rows):
    snapshot_date = row[0]
    sales = row[1]
    if index == 0:
        sales_total_previous += sales
        continue
    count += 1
    sales_avg = sales_total_previous / count
    print((snapshot_date,sales_avg, count))
    sales_total_previous += sales

以下是查询结果(与SQL查询相同):

('2017-01-02', 10.0, 1)
('2017-01-03', 7.0, 2)
('2017-01-04', 9.0, 3)
('2017-01-05', 8.5, 4)
('2017-01-06', 9.8, 5)

我正在考虑转换到Apache Spark,这样我就可以使用python查询,但是几百万行的数据并不算太大(最多只有3-4 GB),使用100 GB RAM的Spark集群似乎有些浪费。是否有一种高效且易于理解的方法可以在SQL中实现O(n)的效率,最好在Postgres/Redshift中实现?

1个回答

5
你似乎想要:

SELECT ts.snapshot_date, 
       AVG(ts.sales) OVER (ORDER BY ts.snapshot_date) AS sales_avg,
       ROW_NUMBER() OVER (ORDER BY ts.snapshot_date) AS COUNT
FROM time_series ts;

使用窗口函数可以使效率更高。


哇,这真是太神奇了。它将我的运行时间从一周缩短到仅23秒。 - user554481

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接