PostgreSQL窗口函数:按比较分区

5
我试图找到在PostgreSQL查询中,在PARTITION BY子句中与当前行进行比较的方法。 假设我在以下查询中有一个5个元素的短列表(实际情况下,我有成千上万行甚至更多)。 我正在尝试为每一行获取下一个不同元素的id(事件列)和上一个不同元素的id。
WITH events AS(
  SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
  UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
  UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
  UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
  UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT lag(id)  over w as previous_different, event
     , lead(id) over w as next_different
FROM events ev
WINDOW w AS (PARTITION BY event!=ev.event ORDER BY date ASC);

我知道比较event!=ev.event是不正确的,但这正是我想要达到的目的。
我得到的结果是(与删除PARTITION BY子句相同):
 |12|2
1|12|3
2|13|4
3|13|5
4|12|

And the result I would like to get is:

 |12|3
 |12|3
2|13|5
2|13|5
4|12|

请问是否有可能实现,并且怎样实现?非常感谢!

编辑:我知道可以用两个JOIN、一个ORDER BY和一个DISTINCT ON来完成,但在真实的数百万行数据中这种方法非常低效。

WITH events AS(
  SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
  UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
  UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
  UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
  UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT DISTINCT ON (e.id, e.date) e1.id, e.event, e2.id
FROM events e
LEFT JOIN events e1 ON (e1.date<=e.date AND e1.id!=e.id AND e1.event!=e.event) 
LEFT JOIN events e2 ON (e2.date>=e.date AND e2.id!=e.id AND e2.event!=e.event) 
ORDER BY e.date ASC, e.id ASC, e1.date DESC, e1.id DESC, e2.date ASC, e2.id ASC

不相关的:生成测试数据时,您可以使用更短的 values 子句,而不是编写 SELECT ... UNION:http://sqlfiddle.com/#!15/d41d8/1471 - user330315
时间戳(不应该被命名为“日期”)是唯一的吗? - Erwin Brandstetter
1
这是一个棘手的问题。您想要一个动态窗口,直到某个其他值与当前值不同为止才会扩展。我认为只能使用RANGE窗口子句来实现,但PostgreSQL尚未支持。恐怕您可能会被连接所困扰,因为前后范围可能是无界的。 - Craig Ringer
@CraigRinger:我想我成功地解决了它的卡顿问题。 :) - Erwin Brandstetter
@ErwinBrandstetter,不,时间戳可能会重复(在同一秒钟发生两个不同的事件,然后按顺序到达id字段)。很抱歉它被称为日期(这是从原始西班牙语名称翻译过来的)。 - Aleix
1
@A 将查询适应您的新规格应该很容易。将 id 添加到 ORDER BY 等中。下次记得提供所有必要的信息,这样我们就不会浪费时间了。 - Erwin Brandstetter
1个回答

12

使用多种不同的窗口函数和两个子查询,这样应该可以快速地工作:

WITH events(id, event, ts) AS (
  VALUES
   (1, 12, '2014-03-19 08:00:00'::timestamp)
  ,(2, 12, '2014-03-19 08:30:00')
  ,(3, 13, '2014-03-19 09:00:00')
  ,(4, 13, '2014-03-19 09:30:00')
  ,(5, 12, '2014-03-19 10:00:00')
   )
SELECT first_value(pre_id)  OVER (PARTITION BY grp ORDER BY ts)      AS pre_id
     , id, ts
     , first_value(post_id) OVER (PARTITION BY grp ORDER BY ts DESC) AS post_id
FROM  (
   SELECT *, count(step) OVER w AS grp
   FROM  (
      SELECT id, ts
           , NULLIF(lag(event) OVER w, event) AS step
           , lag(id)  OVER w AS pre_id
           , lead(id) OVER w AS post_id
      FROM   events
      WINDOW w AS (ORDER BY ts)
      ) sub1
   WINDOW w AS (ORDER BY ts)
   ) sub2
ORDER  BY ts;

使用ts作为时间戳列的名称。假设ts是唯一的 - 并且已建立索引(一个unique constraint会自动完成这个任务)。
在一个包含5万行记录的真实表格测试中,它只需要单次索引扫描。因此,即使是大表格,速度也应该相当快。相比之下,你的带有联接/去重的查询在一分钟后仍未完成(如预期所料)。
即使是优化过的版本,每次处理一个交叉联接(左联接几乎没有限制条件,实际上是一个有限的交叉联接),也在一分钟后未能完成。
为了最佳的大表性能,需要调整内存设置,特别是 work_mem(用于大型排序操作)。如果可以使用RAM,请考虑暂时将其设置得更高以获得更好的性能。更多信息请查看 herehere

如何操作?

  1. In subquery sub1 look at the event from the previous row and only keep that if it has changed, thus marking the first element of a new group. At the same time, get the id of the previous and the next row (pre_id, post_id).

  2. In subquery sub2, count() only counts non-null values. The resulting grp marks peers in blocks of consecutive same events.

  3. In the final SELECT, take the first pre_id and the last post_id per group for each row to arrive at the desired result.
    Actually, this should be even faster in the outer SELECT:

     last_value(post_id) OVER (PARTITION BY grp ORDER BY ts
                               RANGE BETWEEN UNBOUNDED PRECEDING
                                     AND     UNBOUNDED FOLLOWING) AS post_id
    

    ... since the sort order of the window agrees with the window for pre_id, so only a single sort is needed. A quick test seems to confirm it. More about this frame definition.

SQL Fiddle.


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接