PostgreSQL窗口函数：按比较分区

Question

PostgreSQL窗口函数：按比较分区

sqlpostgresqlwindow-functionspostgresql-performance

5

我试图找到在PostgreSQL查询中，在PARTITION BY子句中与当前行进行比较的方法。假设我在以下查询中有一个5个元素的短列表（实际情况下，我有成千上万行甚至更多）。我正在尝试为每一行获取下一个不同元素的id（事件列）和上一个不同元素的id。

WITH events AS(
  SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
  UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
  UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
  UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
  UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT lag(id)  over w as previous_different, event
     , lead(id) over w as next_different
FROM events ev
WINDOW w AS (PARTITION BY event!=ev.event ORDER BY date ASC);

我知道比较event!=ev.event是不正确的，但这正是我想要达到的目的。

我得到的结果是（与删除PARTITION BY子句相同）：

 |12|2
1|12|3
2|13|4
3|13|5
4|12|

And the result I would like to get is:

 |12|3
 |12|3
2|13|5
2|13|5
4|12|

请问是否有可能实现，并且怎样实现？非常感谢！

编辑：我知道可以用两个JOIN、一个ORDER BY和一个DISTINCT ON来完成，但在真实的数百万行数据中这种方法非常低效。

WITH events AS(
  SELECT 1 as id, 12 as event, '2014-03-19 08:00:00'::timestamp as date
  UNION SELECT 2 as id, 12 as event, '2014-03-19 08:30:00'::timestamp as date
  UNION SELECT 3 as id, 13 as event, '2014-03-19 09:00:00'::timestamp as date
  UNION SELECT 4 as id, 13 as event, '2014-03-19 09:30:00'::timestamp as date
  UNION SELECT 5 as id, 12 as event, '2014-03-19 10:00:00'::timestamp as date
)
SELECT DISTINCT ON (e.id, e.date) e1.id, e.event, e2.id
FROM events e
LEFT JOIN events e1 ON (e1.date<=e.date AND e1.id!=e.id AND e1.event!=e.event) 
LEFT JOIN events e2 ON (e2.date>=e.date AND e2.id!=e.id AND e2.event!=e.event) 
ORDER BY e.date ASC, e.id ASC, e1.date DESC, e1.id DESC, e2.date ASC, e2.id ASC

- Aleix

不相关的：生成测试数据时，您可以使用更短的 values 子句，而不是编写 SELECT ... UNION：http://sqlfiddle.com/#!15/d41d8/1471 - user330315

时间戳（不应该被命名为“日期”）是唯一的吗？ - Erwin Brandstetter

1

这是一个棘手的问题。您想要一个动态窗口，直到某个其他值与当前值不同为止才会扩展。我认为只能使用RANGE窗口子句来实现，但PostgreSQL尚未支持。恐怕您可能会被连接所困扰，因为前后范围可能是无界的。 - Craig Ringer

@CraigRinger：我想我成功地解决了它的卡顿问题。 :) - Erwin Brandstetter

@ErwinBrandstetter，不，时间戳可能会重复（在同一秒钟发生两个不同的事件，然后按顺序到达id字段）。很抱歉它被称为日期（这是从原始西班牙语名称翻译过来的）。 - Aleix

1

@A 将查询适应您的新规格应该很容易。将 id 添加到 ORDER BY 等中。下次记得提供所有必要的信息，这样我们就不会浪费时间了。 - Erwin Brandstetter

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Erwin Brandstetter · Accepted Answer

使用多种不同的窗口函数和两个子查询，这样应该可以快速地工作：

WITH events(id, event, ts) AS (
  VALUES
   (1, 12, '2014-03-19 08:00:00'::timestamp)
  ,(2, 12, '2014-03-19 08:30:00')
  ,(3, 13, '2014-03-19 09:00:00')
  ,(4, 13, '2014-03-19 09:30:00')
  ,(5, 12, '2014-03-19 10:00:00')
   )
SELECT first_value(pre_id)  OVER (PARTITION BY grp ORDER BY ts)      AS pre_id
     , id, ts
     , first_value(post_id) OVER (PARTITION BY grp ORDER BY ts DESC) AS post_id
FROM  (
   SELECT *, count(step) OVER w AS grp
   FROM  (
      SELECT id, ts
           , NULLIF(lag(event) OVER w, event) AS step
           , lag(id)  OVER w AS pre_id
           , lead(id) OVER w AS post_id
      FROM   events
      WINDOW w AS (ORDER BY ts)
      ) sub1
   WINDOW w AS (ORDER BY ts)
   ) sub2
ORDER  BY ts;

使用ts作为时间戳列的名称。假设ts是唯一的 - 并且已建立索引（一个unique constraint会自动完成这个任务）。

在一个包含5万行记录的真实表格测试中，它只需要单次索引扫描。因此，即使是大表格，速度也应该相当快。相比之下，你的带有联接/去重的查询在一分钟后仍未完成（如预期所料）。

即使是优化过的版本，每次处理一个交叉联接（左联接几乎没有限制条件，实际上是一个有限的交叉联接），也在一分钟后未能完成。

为了最佳的大表性能，需要调整内存设置，特别是 work_mem（用于大型排序操作）。如果可以使用RAM，请考虑暂时将其设置得更高以获得更好的性能。更多信息请查看 here 和 here。

如何操作？

In subquery sub1 look at the event from the previous row and only keep that if it has changed, thus marking the first element of a new group. At the same time, get the id of the previous and the next row (pre_id, post_id).
In subquery sub2, count() only counts non-null values. The resulting grp marks peers in blocks of consecutive same events.
In the final SELECT, take the first pre_id and the last post_id per group for each row to arrive at the desired result.
Actually, this should be even faster in the outer SELECT:
```
 last_value(post_id) OVER (PARTITION BY grp ORDER BY ts
                           RANGE BETWEEN UNBOUNDED PRECEDING
                                 AND     UNBOUNDED FOLLOWING) AS post_id
```
... since the sort order of the window agrees with the window for pre_id, so only a single sort is needed. A quick test seems to confirm it. More about this frame definition.

SQL Fiddle.