在PostgreSQL中找到所有范围集合的交集

6
我正在寻找一种高效的方法,以查找时间戳范围之间的所有交集。它需要与 PostgreSQL 9.2 兼容。
假设这些范围表示一个人可用于会面的时间。每个人可能有一段或多段可用时间范围。我想找到所有会议可以进行的时间段(即期间所有人都可用)。
以下是我目前的想法。它似乎可以工作,但我认为效率不高,因为它一次只考虑一个人的可用性。
WITH RECURSIVE td AS
(
    -- Test data. Returns:
    -- ["2014-01-20 00:00:00","2014-01-31 00:00:00")
    -- ["2014-02-01 00:00:00","2014-02-20 00:00:00")
    -- ["2014-04-15 00:00:00","2014-04-20 00:00:00")
    SELECT 1 AS entity_id, '2014-01-01'::timestamp AS begin_time, '2014-01-31'::timestamp AS end_time
    UNION SELECT 1, '2014-02-01', '2014-02-28'
    UNION SELECT 1, '2014-04-01', '2014-04-30'
    UNION SELECT 2, '2014-01-15', '2014-02-20'
    UNION SELECT 2, '2014-04-15', '2014-05-05'
    UNION SELECT 3, '2014-01-20', '2014-04-20'
)
, ranges AS
(
    -- Convert to tsrange type
    SELECT entity_id, tsrange(begin_time, end_time) AS the_range
    FROM td
)
, min_max AS
(
    SELECT MIN(entity_id), MAX(entity_id)
    FROM td
)
, inter AS
(
    -- Ranges for the lowest ID
    SELECT entity_id AS last_id, the_range
    FROM ranges r
    WHERE r.entity_id = (SELECT min FROM min_max)

    UNION ALL

    -- Iteratively intersect with ranges for the next higher ID
    SELECT entity_id, r.the_range * i.the_range
    FROM ranges r
    JOIN inter i ON r.the_range && i.the_range
    WHERE r.entity_id > i.last_id
        AND NOT EXISTS
        (
            SELECT *
            FROM ranges r2
            WHERE r2.entity_id < r.entity_id AND r2.entity_id > i.last_id
        )
)
-- Take the final set of intersections
SELECT *
FROM inter
WHERE last_id = (SELECT max FROM min_max)
ORDER BY the_range;

3
不相关内容:提供“静态数据”可以使用values子句更简洁地完成,而无需使用selectunionvalues (1, '2014-01-01'::timestamp, '2014-01-31'::timestamp), (2, ...)并在公用表达式中定义列名。 - user330315
3个回答

7
我创建了tsrange_interception_agg聚合函数。
create function tsrange_interception (
    internal_state tsrange, next_data_values tsrange
) returns tsrange as $$
    select internal_state * next_data_values;
$$ language sql;

create aggregate tsrange_interception_agg (tsrange) (
    sfunc = tsrange_interception,
    stype = tsrange,
    initcond = $$[-infinity, infinity]$$
);

那么这个查询

with td (id, begin_time, end_time) as
(
    values
    (1, '2014-01-01'::timestamp, '2014-01-31'::timestamp),
    (1, '2014-02-01', '2014-02-28'),
    (1, '2014-04-01', '2014-04-30'),
    (2, '2014-01-15', '2014-02-20'),
    (2, '2014-04-15', '2014-05-05'),
    (3, '2014-01-20', '2014-04-20')
), ranges as (
    select
        id,
        row_number() over(partition by id) as rn,
        tsrange(begin_time, end_time) as tr
    from td
), cr as (
    select r0.tr tr0, r1.tr as tr1
    from ranges r0 cross join ranges r1
    where
        r0.id < r1.id and
        r0.tr && r1.tr and
        r0.id = (select min(id) from td)
)
select tr0 * tsrange_interception_agg(tr1) as interseptions
from cr
group by tr0
having count(*) = (select count(distinct id) from td) - 1
;
                 interseptions                 
-----------------------------------------------
 ["2014-02-01 00:00:00","2014-02-20 00:00:00")
 ["2014-01-20 00:00:00","2014-01-31 00:00:00")
 ["2014-04-15 00:00:00","2014-04-20 00:00:00")

谢谢!我使用了这个聚合的想法来解决我的原始问题,结果比这更复杂,所以我标记为已接受。(我不能仅将其简化为一组tsranges,但仍然使用了一个聚合,只是更复杂的一个。) - EM0

1
如果你有一定数量的实体需要交叉引用,可以使用交叉连接对它们中的每一个进行交叉引用,并使用范围上的 * 运算符构建交集。
尽管如此,像这样使用交叉连接可能不够高效。下面的示例更多地是为了解释下面更复杂的示例。
WITH td AS
(
    SELECT 1 AS entity_id, '2014-01-01'::timestamp AS begin_time, '2014-01-31'::timestamp AS end_time
    UNION SELECT 1, '2014-02-01', '2014-02-28'
    UNION SELECT 1, '2014-04-01', '2014-04-30'
    UNION SELECT 2, '2014-01-15', '2014-02-20'
    UNION SELECT 2, '2014-04-15', '2014-05-05'
    UNION SELECT 4, '2014-01-20', '2014-04-20'
)
,ranges AS
(
    -- Convert to tsrange type
    SELECT entity_id, tsrange(begin_time, end_time) AS the_range
    FROM td
)
SELECT r1.the_range * r2.the_range * r3.the_range AS r
FROM ranges r1
CROSS JOIN ranges r2
CROSS JOIN ranges r3
WHERE r1.entity_id=1 AND r2.entity_id=2 AND r3.entity_id=4
  AND NOT isempty(r1.the_range * r2.the_range * r3.the_range)
ORDER BY r

在这种情况下,多重交叉连接可能不太有效,因为实际上您不需要所有范围的所有可能组合,因为isempty(r1.the_range * r2.the_range)足以使isempty(r1.the_range * r2.the_range * r3.the_range)为真。
我认为您无法避免查看每个人在某个时间的可用性,因为您希望他们都能见面。
有所帮助的是通过将每个人的可用性与使用另一个递归CTE(intersections)计算出的前一个子集进行交叉连接来逐步构建交集集合。然后,您可以逐步构建交集并且去除空范围,这两个存储数组:
WITH RECURSIVE td AS
(
    SELECT 1 AS entity_id, '2014-01-01'::timestamp AS begin_time, '2014-01-31'::timestamp AS end_time
    UNION SELECT 1, '2014-02-01', '2014-02-28'
    UNION SELECT 1, '2014-04-01', '2014-04-30'
    UNION SELECT 2, '2014-01-15', '2014-02-20'
    UNION SELECT 2, '2014-04-15', '2014-05-05'
    UNION SELECT 4, '2014-01-20', '2014-04-20'
)
,ranges AS
(
    -- Convert to tsrange type
    SELECT entity_id, tsrange(begin_time, end_time) AS the_range
    FROM td
)
,ranges_arrays AS (
    -- Prepare an array of all possible intervals per entity
    SELECT entity_id, array_agg(the_range) AS ranges_arr
    FROM ranges
       GROUP BY entity_id
)
,numbered_ranges_arrays AS (
    -- We'll join using pos+1 next, so we want continuous integers
    -- I've changed the example entity_id from 3 to 4 to demonstrate this.
    SELECT ROW_NUMBER() OVER () AS pos, entity_id, ranges_arr
    FROM ranges_arrays
)
,intersections (pos, subranges) AS (
    -- We start off with the infinite range.
    SELECT 0::bigint, ARRAY['[,)'::tsrange]
    UNION ALL
    -- Then, we unnest the previous intermediate result,
    -- cross join it against the array of ranges from the
    -- next row in numbered_ranges_arrays (joined via pos+1).
    -- We take the intersection and remove the empty array.
    SELECT r.pos,
           ARRAY(SELECT x * y FROM unnest(r.ranges_arr) x CROSS JOIN unnest(i.subranges) y WHERE NOT isempty(x * y))
    FROM numbered_ranges_arrays r
        INNER JOIN intersections i ON r.pos=i.pos+1
)
,last_intersections AS (
    -- We just really want the result from the last operation (with the max pos).
    SELECT subranges FROM intersections ORDER BY pos DESC LIMIT 1
)
SELECT unnest(subranges) r FROM last_intersections ORDER BY r

很遗憾,我不确定这是否能够表现更好。您可能需要更大的数据集以进行有意义的基准测试。

谢谢,这个方法有效,尽管性能与我原来的查询类似。但是聚合解决方案最终稍微快一些。 - EM0
@EM 很好知道。只是出于好奇,如果您有一个数据集可以测试这个附近,请尝试看看是否将我的查询中的WHERE NOT isempty(x * y)更改为WHERE x && yCROSS JOIN中会提高性能? - Bruno

0

好的,我在TSQL中编写并测试了这个代码,但它应该可以运行或者至少足够接近,因为它都是相当基础的结构。除了between语句,但是它可以被分解成一个<子句和一个>子句。(感谢@Horse)

WITH cteSched AS ( --Schedule for everyone
    -- Test data. Returns:
    -- ["2014-01-20 00:00:00","2014-01-31 00:00:00")
    -- ["2014-02-01 00:00:00","2014-02-20 00:00:00")
    -- ["2014-04-15 00:00:00","2014-04-20 00:00:00")
    SELECT 1 AS entity_id, '2014-01-01' AS begin_time, '2014-01-31' AS end_time
    UNION SELECT 1, '2014-02-01', '2014-02-28'
    UNION SELECT 1, '2014-04-01', '2014-04-30'
    UNION SELECT 2, '2014-01-15', '2014-02-20'
    UNION SELECT 2, '2014-04-15', '2014-05-05'
    UNION SELECT 3, '2014-01-20', '2014-04-20'
), cteReq as (  --List of people to schedule (or is everyone in Sched required? Not clear, doesn't hurt)
    SELECT 1 as entity_id UNION SELECT 2 UNION SELECT 3
), cteBegins as (
    SELECT distinct begin_time FROM cteSched as T 
    WHERE NOT EXISTS (SELECT entity_id FROM cteReq as R 
                      WHERE NOT EXISTS (SELECT * FROM cteSched as X 
                                        WHERE X.entity_id = R.entity_id 
                                            AND T.begin_time BETWEEN X.begin_time AND X.end_time ))
) SELECT B.begin_time, MIN(S.end_time ) as end_time  
  FROM cteBegins as B cross join cteSched as S 
  WHERE B.begin_time between S.begin_time and S.end_time 
  GROUP BY B.begin_time
-- NOTE: This assume users do not have schedules that overlap with themselves! That is, nothing like
-- John is available 2014-01-01 to 2014-01-15 and 2014-01-10 to 2014-01-20. 

编辑:添加在 SQL-Server 2008R2 上执行时的输出
开始时间 结束时间
2014年01月20日 2014年01月31日
2014年02月01日 2014年02月20日
2014年04月15日 2014年04月20日


"between"是标准的SQL语句,在每个DBMS中都被支持。 - user330315
抱歉,在我的原始数据上,我无法使其正常工作,但无论如何,感谢您提供的替代方法。 - EM0
@E-M,哪里出了问题?你收到错误信息了吗?输出结果有什么不同吗? - Robert Sheahan

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接