在Postgres中高效查询时间序列数据

11

我在我的PG数据库中有一张类似于这样的表:

id | widget_id | for_date | score |
每个引用的小部件都有很多这些项。总是每个小部件每天1个,但之间有间隔。
我想获取的结果是包含自X以来每个日期的所有小部件的结果。日期通过生成系列带入:
 SELECT date.date::date
   FROM generate_series('2012-01-01'::timestamp with time zone,'now'::text::date::timestamp with time zone, '1 day') date(date)
 ORDER BY date.date DESC;
如果给定的 widget_id 在某个日期没有条目,我希望使用上一个条目。例如,假设小部件 1337 没有在 2012-05-10 上的条目,但在 2012-05-08 上,则希望结果集也显示 2012-05-10 上的 2012-05-08 条目:
Actual data:
widget_id | for_date   | score
1312      | 2012-05-07 | 20
1337      | 2012-05-07 | 12
1337      | 2012-05-08 | 41
1337      | 2012-05-11 | 500

Desired output based on generate series:
widget_id | for_date   | score
1336      | 2012-05-07 | 20
1337      | 2012-05-07 | 12
1336      | 2012-05-08 | 20
1337      | 2012-05-08 | 41
1336      | 2012-05-09 | 20
1337      | 2012-05-09 | 41
1336      | 2012-05-10 | 20
1337      | 2012-05-10 | 41
1336      | 2012-05-11 | 20
1337      | 2012-05-11 | 500

最终我希望将其归纳为一个视图,这样每天我都可以获得一致的数据集并轻松查询。

编辑:使示例数据和期望结果集更清晰。


请澄清:您是想一次列出所有的widget_id行(如您在顶部的文本中所示),还是仅针对给定的“widget_id”列出行(如您的示例所示)? - Erwin Brandstetter
为什么所需的输出中widget_id=1312缺失了?而widget_id=1336又是从哪里来的? - wildplasser
跟进解决缺失值的方案:https://dev59.com/c3jZa4cB1Zd3GeqPaBwB#19452830 - Erwin Brandstetter
4个回答

8

SQL Fiddle

select
    widget_id,
    for_date,
    case
        when score is not null then score
        else first_value(score) over (partition by widget_id, c order by for_date)
        end score
from (
    select
        a.widget_id,
        a.for_date,
        s.score,
        count(score) over(partition by a.widget_id order by a.for_date) c
    from (
        select widget_id, g.d::date for_date
        from (
            select distinct widget_id
            from score
            ) s
            cross join
            generate_series(
                (select min(for_date) from score),
                (select max(for_date) from score),
                '1 day'
            ) g(d)
        ) a
        left join
        score s on a.widget_id = s.widget_id and a.for_date = s.for_date
) s
order by widget_id, for_date

7

首先,您可以使用更简单的generate_series()表达式。与您的表达式相当(除了降序排序,这与您的问题其他部分相矛盾):

SELECT generate_series('2012-01-01'::date, now()::date, '1d')::date

类型date在输入时自动转换为timestamptz类型。无论如何,返回的类型都是timestamptz。下面我使用子查询,因此可以立即将输出强制转换为date类型。

接下来,窗口函数max()返回您需要的准确值:自frame开始以来忽略NULL值的最高值。基于此,您可以获得一个非常简单的查询。

对于给定的widget_id

与涉及CROSS JOINWITH RECURSIVE相比,大多数情况下更快:

SELECT a.day, s.*
FROM  (
   SELECT d.day
         ,max(s.for_date) OVER (ORDER BY d.day) AS effective_date
   FROM  (
      SELECT generate_series('2012-01-01'::date, now()::date, '1d')::date
      ) d(day)
   LEFT   JOIN score s ON s.for_date = d.day
                      AND s.widget_id = 1337 -- "for a given widget_id"
   ) a
LEFT   JOIN score s ON s.for_date = a.effective_date
                   AND s.widget_id = 1337
ORDER  BY a.day;

->sqlfiddle

使用此查询,您可以将score中任何列放入最终的SELECT列表中,我为简单起见放置了 s.* ,选择您自己需要的列。

如果您想要以实际上拥有得分的第一天开始输出,请用JOIN替换最后一个 LEFT JOIN

适用于所有widget_id的通用表格

在这里,我使用CROSS JOIN为每个日期上的每个部件生成一行 ..

SELECT a.day, a.widget_id, s.score
FROM  (
   SELECT d.day, w.widget_id
         ,max(s.for_date) OVER (PARTITION BY w.widget_id
                                ORDER BY d.day) AS effective_date
   FROM  (SELECT generate_series('2012-05-05'::date
                                ,'2012-05-15'::date, '1d')::date AS day) d
   CROSS  JOIN (SELECT DISTINCT widget_id FROM score) AS w
   LEFT   JOIN score s ON s.for_date = d.day AND s.widget_id = w.widget_id
   ) a
JOIN  score s ON s.for_date = a.effective_date
             AND s.widget_id = a.widget_id  -- instead of LEFT JOIN
ORDER BY a.day, a.widget_id;

->sqlfiddle


1
你的交叉连接查询是否适用于多个widget_ids?SQL Fiddle。从问题中得知:我想要的结果是包含自X以来每个日期的所有小部件的结果。 - Clodoaldo Neto
而且@Clodoaldo是正确的,当然,使用“CROSS JOIN”是生成通用解决方案光栅图的方法。 - Erwin Brandstetter
很好的解决方案。我已经在项目中使用了一个变种,但是当“先前”的值落在你正在查看的时间段之前时,我遇到了一个问题。请查看此SQL Fiddle以获取最简单的示例。有没有一种方法可以调整这个问题?谢谢! - bpaul
@bpaul:就像我第一个例子中所示,您可以使用LEFT JOIN([SQLfiddle](http://www.sqlfiddle.com/#!12/d7a20/3))。如果您想要其他内容,我建议您开始一个*新问题*,在其中提供所有必要的细节。您始终可以参考此上下文。 - Erwin Brandstetter
@ErwinBrandstetter:感谢您的回复,我发布了一个新问题https://dev59.com/c3jZa4cB1Zd3GeqPaBwB,您能看一下吗? - bpaul
显示剩余4条评论

2

使用您的表结构,我创建了以下递归CTE,它从您的MIN(For_Date)开始递增,直到达到MAX(For_Date)。不确定是否有更有效的方法,但这似乎运行良好:

WITH RECURSIVE nodes_cte(widgetid, for_date, score) AS (
-- First Widget Using Min Date
 SELECT 
    w.widgetId, 
    w.for_date, 
    w.score
 FROM widgets w 
  INNER JOIN ( 
      SELECT widgetId, Min(for_date) min_for_date
      FROM widgets
      GROUP BY widgetId
   ) minW ON w.widgetId = minW.widgetid 
        AND w.for_date = minW.min_for_date
UNION ALL
 SELECT 
    n.widgetId,
    n.for_date + 1 for_date,
    coalesce(w.score,n.score) score
 FROM nodes_cte n
  INNER JOIN (
      SELECT widgetId, Max(for_date) max_for_date
      FROM widgets 
      GROUP BY widgetId
   ) maxW ON n.widgetId = maxW.widgetId
  LEFT JOIN widgets w ON n.widgetid = w.widgetid 
    AND n.for_date + 1 = w.for_date
  WHERE n.for_date + 1 <= maxW.max_for_date
)
SELECT * 
FROM nodes_cte 
ORDER BY for_date

这里是SQL Fiddle
以下是返回的结果(日期格式可自定义):
WIDGETID   FOR_DATE                     SCORE
1337       May, 07 2012 00:00:00+0000   12
1337       May, 08 2012 00:00:00+0000   41
1337       May, 09 2012 00:00:00+0000   41
1337       May, 10 2012 00:00:00+0000   41
1337       May, 11 2012 00:00:00+0000   500

请注意,这假设您的For_Date字段是日期——如果它包括时间——则您可能需要在上面的查询中使用间隔“1天”。 希望这可以帮助。

0

数据:

DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;

CREATE TABLE widget
        ( widget_id INTEGER NOT NULL
        , for_date DATE NOT NULL
        , score INTEGER
         , PRIMARY KEY (widget_id,for_date)
        );
INSERT INTO widget(widget_id , for_date , score) VALUES
 (1312, '2012-05-07', 20)
, (1337, '2012-05-07', 12)
, (1337, '2012-05-08', 41)
, (1337, '2012-05-11', 500)
        ;

查询:

SELECT w.widget_id AS widget_id
        , cal::date AS for_date
        -- , w.for_date AS org_date
        , w.score AS score
FROM generate_series( '2012-05-07'::timestamp , '2012-05-11'::timestamp
                 , '1day'::interval) AS cal
        -- "half cartesian" Join;
        -- will be restricted by the NOT EXISTS() below
LEFT JOIN widget w ON w.for_date <= cal
WHERE NOT EXISTS (
        SELECT * FROM widget nx
        WHERE nx.widget_id = w.widget_id
        AND nx.for_date <= cal
        AND nx.for_date > w.for_date
        )
ORDER BY cal, w.widget_id
        ;

结果:

 widget_id |  for_date  | score 
-----------+------------+-------
      1312 | 2012-05-07 |    20
      1337 | 2012-05-07 |    12
      1312 | 2012-05-08 |    20
      1337 | 2012-05-08 |    41
      1312 | 2012-05-09 |    20
      1337 | 2012-05-09 |    41
      1312 | 2012-05-10 |    20
      1337 | 2012-05-10 |    41
      1312 | 2012-05-11 |    20
      1337 | 2012-05-11 |   500
(10 rows)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接