将相交的时间段展平

15
我有很多带有给定ID的开始和停止时间的数据,需要将所有相交和相邻的时间段合并为一个组合时间段。下面发布的样本数据都属于同一ID。
为了使事情更清晰,请查看2009年3月6日的样本数据:
以下时间跨度重叠或连续,需要合并成一个时间跨度:
05:54:48 - 10:00:13 09:26:45 - 09:59:40
结果时间跨度将从05:54:48到10:00:13。由于在10:00:13和10:12:50之间存在间隔,因此我们还有以下时间跨度:
10:12:50 - 10:27:25 10:13:12 - 11:14:56 10:27:25 - 10:27:31 10:27:39 - 13:53:38 11:14:56 - 11:15:03 11:15:30 - 14:02:14 13:53:38 - 13:53:43 14:02:14 - 14:02:31
它们会形成一个合并后的时间跨度,从10:12:50到14:02:31,因为它们是相互重叠或相邻的。
下面是我需要的样本数据和展平后的数据。持续时间列只是信息性的。
任何解决方案,无论是SQL还是其他,都可以接受。
编辑:由于有许多不同且有趣的解决方案,我通过添加约束来改进原始问题,以查看最佳(如果有)解决方案是否会浮现:
我通过ODBC从另一个系统获取数据。没有办法为我更改表格布局或添加索引。
该数据仅由日期列进行索引(时间部分不是)。
每天大约有2.5k行。
数据的估计使用模式如下所示:
大多数时间(假设90%),用户只查询一两天(2.5k-5k行)
有时(9%)范围将达到一个月左右(约75k行)
很少(1%)范围将达到一年左右(约900k行)
对于典型情况,查询应快速完成,对于罕见情况,则不应“持续很久”。
查询一年的数据需要约5分钟(无需加入即可选择)
在这些限制条件下,哪种解决方案最好?我担心大多数解决方案将非常缓慢,因为它们基于日期和时间的组合连接,而在我的情况下,这不是索引字段。
您会在客户端还是服务器端执行所有合并操作?您会首先创建一个优化的临时表,并使用其中一种提议的解决方案吗?目前我没有时间测试解决方案,但我会让您知道哪种方法最适合我。
样本数据:
Date       | Start    | Stop
-----------+----------+---------
02.06.2009 | 05:55:28 | 09:58:27
02.06.2009 | 10:15:19 | 13:58:24
02.06.2009 | 13:58:24 | 13:58:43
03.06.2009 | 05:54:48 | 10:00:13
03.06.2009 | 09:26:45 | 09:59:40
03.06.2009 | 10:12:50 | 10:27:25
03.06.2009 | 10:13:12 | 11:14:56
03.06.2009 | 10:27:25 | 10:27:31
03.06.2009 | 10:27:39 | 13:53:38
03.06.2009 | 11:14:56 | 11:15:03
03.06.2009 | 11:15:30 | 14:02:14
03.06.2009 | 13:53:38 | 13:53:43
03.06.2009 | 14:02:14 | 14:02:31
04.06.2009 | 05:48:27 | 09:58:59
04.06.2009 | 06:00:00 | 09:59:07
04.06.2009 | 10:15:52 | 13:54:52
04.06.2009 | 10:16:01 | 13:24:20
04.06.2009 | 13:24:20 | 13:24:24
04.06.2009 | 13:24:32 | 14:00:39
04.06.2009 | 13:54:52 | 13:54:58
04.06.2009 | 14:00:39 | 14:00:49
05.06.2009 | 05:53:58 | 09:59:12
05.06.2009 | 10:16:05 | 13:59:08
05.06.2009 | 13:59:08 | 13:59:16
06.06.2009 | 06:04:00 | 10:00:00
06.06.2009 | 10:16:54 | 10:18:40
06.06.2009 | 10:18:40 | 10:18:45
06.06.2009 | 10:23:00 | 13:57:00
06.06.2009 | 10:23:48 | 13:57:54
06.06.2009 | 13:57:21 | 13:57:38
06.06.2009 | 13:57:54 | 13:57:58
07.06.2009 | 21:59:30 | 01:58:49
07.06.2009 | 22:12:16 | 01:58:39
07.06.2009 | 22:12:25 | 01:58:28
08.06.2009 | 02:10:33 | 05:56:11
08.06.2009 | 02:10:43 | 05:56:23
08.06.2009 | 02:10:49 | 05:55:59
08.06.2009 | 05:55:59 | 05:56:01
08.06.2009 | 05:56:11 | 05:56:14
08.06.2009 | 05:56:23 | 05:56:27

展开后的结果:

Date       | Start    | Stop     | Duration
-----------+----------+----------+---------
02.06.2009 | 05:55:28 | 09:58:27 | 04:02:59
02.06.2009 | 10:15:19 | 13:58:43 | 03:43:24
03.06.2009 | 05:54:48 | 10:00:13 | 04:05:25
03.06.2009 | 10:12:50 | 14:02:31 | 03:49:41
04.06.2009 | 05:48:27 | 09:59:07 | 04:10:40
04.06.2009 | 10:15:52 | 14:00:49 | 03:44:58
05.06.2009 | 05:53:58 | 09:59:12 | 04:05:14
05.06.2009 | 10:16:05 | 13:59:16 | 03:43:11
06.06.2009 | 06:04:00 | 10:00:00 | 03:56:00
06.06.2009 | 10:16:54 | 10:18:45 | 00:01:51
06.06.2009 | 10:23:00 | 13:57:58 | 03:34:58
07.06.2009 | 21:59:30 | 01:58:49 | 03:59:19
08.06.2009 | 02:10:33 | 05:56:27 | 03:45:54

回答 Ed 最初的问题:不,不会有一个大于 24 小时的单一开始和结束时间。 - VVS
我只是想知道未组合时间跨度通常有多长(为了生成1000k行一年的测试数据)。我应该尝试每天在10小时班次内使用2500个间隔,每个间隔为10-15秒吗?还是班次内的间隔只有几秒钟到几个小时之间?两种情况的运行时间可能会非常不同。 - Not Matt
@Quassnoi:我正在使用SQL Server 2005,数据是通过ODBC数据源检索的,该数据源已集成为链接服务器。由于ODBC驱动程序“不太好”,所以我必须使用openquery()。 - VVS
因此,还必须考虑另一个字段user_id(没有索引)。 - Not Matt
@Matt:嗯,不是的...那个字段已经建立了索引(还有一个客户端ID也建立了索引)。在我看来,问题在于将日期和时间视为一个索引字段(但在我的情况下并不是这样)。 - VVS
显示剩余7条评论
8个回答

8
这里提供了一个仅使用SQL的解决方案。我使用DATETIME列。在我看来,将时间分开存储是一个错误,因为当时间超过午夜时,你会遇到问题。如果需要,你可以调整此解决方案以处理该情况。该解决方案还假定开始和结束时间不为空。如果不是这种情况,你可以根据需要进行调整。
该解决方案的一般思路是获取所有与任何其他时间段不重叠的开始时间,获取所有与任何时间段不重叠的结束时间,然后将它们匹配在一起。
除了一个情况外,结果与你的预期结果相匹配。手动检查后发现,在6日应该有一个时间段在2009-06-06 10:18:45.000结束。
SELECT
     ST.start_time,
     ET.end_time
FROM
(
     SELECT
          T1.start_time
     FROM
          dbo.Test_Time_Spans T1
     LEFT OUTER JOIN dbo.Test_Time_Spans T2 ON
          T2.start_time < T1.start_time AND
          T2.end_time >= T1.start_time
     WHERE
          T2.start_time IS NULL
) AS ST
INNER JOIN
(
     SELECT
          T3.end_time
     FROM
          dbo.Test_Time_Spans T3
     LEFT OUTER JOIN dbo.Test_Time_Spans T4 ON
          T4.end_time > T3.end_time AND
          T4.start_time <= T3.end_time
     WHERE
          T4.start_time IS NULL
) AS ET ON
     ET.end_time > ST.start_time
LEFT OUTER JOIN
(
     SELECT
          T5.end_time
     FROM
          dbo.Test_Time_Spans T5
     LEFT OUTER JOIN dbo.Test_Time_Spans T6 ON
          T6.end_time > T5.end_time AND
          T6.start_time <= T5.end_time
     WHERE
          T6.start_time IS NULL
) AS ET2 ON
     ET2.end_time > ST.start_time AND
     ET2.end_time < ET.end_time
WHERE
     ET2.end_time IS NULL

你说得对,我错过了从10:16:54到10:18:45的时间段。已经相应地更正了预期结果。我通过ODBC从另一个系统获取数据,无法更改底层表格格式。 - VVS
感谢您提供的好解决方案。它对我也起作用,除了当我有两条记录具有完全相同的结束时间时。我将任何NULL结束时间设置为当前时间戳以获得所需的计算结果,但是这会导致此查询出现重复结果。我可以使用DISTINCT,但我想知道是否有一种简单的方法来修改此内容以处理重复的开始/结束时间。此外,这能处理仅有一条记录的情况吗? - trademark

4

MySQL 中:

SELECT  grouper, MIN(start) AS group_start, MAX(end) AS group_end
FROM    (
        SELECT  start,
                end,
                @r := @r + (@edate < start) AS grouper,
                @edate := GREATEST(end, CAST(@edate AS DATETIME))
        FROM    (
                SELECT  @r := 0,
                        @edate := CAST('0000-01-01' AS DATETIME)
                ) vars,
                (
                SELECT  rn_date + INTERVAL TIME_TO_SEC(rn_start) SECOND AS start,
                        rn_date + INTERVAL TIME_TO_SEC(rn_end) SECOND + INTERVAL (rn_start > rn_end) DAY AS end
                FROM    t_ranges
                ) q
        ORDER BY
                start
        ) q
GROUP BY
        grouper
ORDER BY
        group_start

以下是我博客中描述 SQL Server 相关决策的文章:

这里是执行此操作的函数:

DROP FUNCTION fn_spans
GO
CREATE FUNCTION fn_spans(@p_from DATETIME, @p_till DATETIME)
RETURNS @t TABLE
        (
        q_start DATETIME NOT NULL,
        q_end DATETIME NOT NULL
        )
AS
BEGIN
        DECLARE @qs DATETIME
        DECLARE @qe DATETIME
        DECLARE @ms DATETIME
        DECLARE @me DATETIME
        DECLARE cr_span CURSOR FAST_FORWARD
        FOR
        SELECT  s_date + s_start AS q_start,
                s_date + s_stop + CASE WHEN s_start < s_stop THEN 0 ELSE 1 END AS q_end
        FROM    t_span
        WHERE   s_date BETWEEN @p_from - 1 AND @p_till
                AND s_date + s_start >= @p_from
                AND s_date + s_stop <= @p_till
        ORDER BY
                q_start
        OPEN    cr_span
        FETCH   NEXT
        FROM    cr_span
        INTO    @qs, @qe
        SET @ms = @qs
        SET @me = @qe
        WHILE @@FETCH_STATUS = 0
        BEGIN
                FETCH   NEXT
                FROM    cr_span
                INTO    @qs, @qe
                IF @qs > @me
                BEGIN
                        INSERT
                        INTO    @t
                        VALUES (@ms, @me)
                        SET @ms = @qs
                END
                SET @me = CASE WHEN @qe > @me THEN @qe ELSE @me END
        END
        IF @ms IS NOT NULL 
        BEGIN
                INSERT
                INTO    @t
                VALUES  (@ms, @me)
        END
        CLOSE   cr_span
        RETURN
END

由于SQL Server缺乏一种简单的方法来引用先前选择的结果集中的行,所以这是一种罕见情况,当在SQL Server中使用游标时比基于集合的决策更快。

1,440,000行上测试,对于完整集合需要24秒,而对于一两天的范围几乎瞬间完成。

请注意SELECT查询中的附加条件:

s_date BETWEEN @p_from - 1 AND @p_till

这似乎是多余的,但它实际上是一个粗略的筛选器,使您在 s_date 上的索引可用。


@David:每当两个相邻的时间跨度不相交时,它会增加分组器,因此所有相交的时间跨度都进入一个组。然后返回每个组的最小和最大日期。 - Quassnoi

3

在SO上有类似的问题:

Min effective and termdate for contiguous dates

值得一提的是,我点赞了建议使用Joe Celko的SQL For Smarties第三版(2005年)的答案,该书讨论了不同的方法,包括集合和过程化方法。


谢谢你的提示,特别是书籍提示 - 这看起来对我来说是必读的 :) - VVS

2

假设您:

  • 有一种简单的自定义日期对象,用于存储开始日期/时间和结束日期/时间
  • 按开始日期/时间排序并作为列表返回行,L,其中包含这些日期
  • 想要创建一个扁平化的日期列表,F

请按照以下步骤操作:

first = first row in L
flat_date.start = first.start, flat_date.end = first.end
For each row in L:
    if row.start < flat_date.end and row.end > flat_date.end: // adding on to a timespan
        flat_date.end = row.end
    else: // ending a timespan and starting a new one
        add flat_date to F
        flat_date.start = row.start, flat_date.end = row.end
add flat_date to F // adding the last timespan to the flattened list

谢谢,看起来很有希望.. 我会尝试一下。 - VVS

1
这里是一个递归CTE解决方案,但我将为每个列分配日期和时间,而不是单独提取日期。这有助于避免一些混乱的特殊情况代码。如果您必须将日期分开存储,我建议使用CTE的视图将其看作两个日期时间列,并采用此方法。
创建测试数据:
create table t1 (d1 datetime, d2 datetime)

insert t1 (d1,d2)
    select           '2009-06-03 10:00:00', '2009-06-03 14:00:00'
    union all select '2009-06-03 13:55:00', '2009-06-03 18:00:00'
    union all select '2009-06-03 17:55:00', '2009-06-03 23:00:00'
    union all select '2009-06-03 22:55:00', '2009-06-04 03:00:00'

    union all select '2009-06-04 03:05:00', '2009-06-04 07:00:00'

    union all select '2009-06-04 07:05:00', '2009-06-04 10:00:00'
    union all select '2009-06-04 09:55:00', '2009-06-04 14:00:00'

递归CTE:

;with dateRanges (ancestorD1, parentD1, d2, iter) as
(
--anchor is first level of collapse
    select
        d1 as ancestorD1,
        d1 as parentD1,
        d2,
        cast(0 as int) as iter
    from t1

--recurse as long as there is another range to fold in
    union all select
        tLeft.ancestorD1,
        tRight.d1 as parentD1,
        tRight.d2,
        iter + 1  as iter
    from dateRanges as tLeft join t1 as tRight
        --join condition is that the t1 row can be consumed by the recursive row
        on tLeft.d2 between tRight.d1 and tRight.d2
            --exclude identical rows
            and not (tLeft.parentD1 = tRight.d1 and tLeft.d2 = tRight.d2)
)
select
    ranges1.*
from dateRanges as ranges1
where not exists (
    select 1
    from dateRanges as ranges2
    where ranges1.ancestorD1 between ranges2.ancestorD1 and ranges2.d2
        and ranges1.d2 between ranges2.ancestorD1 and ranges2.d2
        and ranges2.iter > ranges1.iter
)

输出结果为:
ancestorD1              parentD1                d2                      iter
----------------------- ----------------------- ----------------------- -----------
2009-06-04 03:05:00.000 2009-06-04 03:05:00.000 2009-06-04 07:00:00.000 0
2009-06-04 07:05:00.000 2009-06-04 09:55:00.000 2009-06-04 14:00:00.000 1
2009-06-03 10:00:00.000 2009-06-03 22:55:00.000 2009-06-04 03:00:00.000 3

哇..我需要理解这种方法..给我一点时间 :) - VVS
为什么不使用演示数据,以便更容易进行比较呢? - VVS
我尝试使用原始数据,但并没有按照要求工作。我无法将经过重新处理的代码与示例数据粘贴在评论中,因为它太长了。 :( - Bernhard Hofmann

0
扩展MahlerFive的答案,我编写了一个Swift扩展程序到DateTools。目前为止,它已经通过了我所有的测试。
extension DTTimePeriodCollection {

    func flatten() {

        self.sortByStartAscending()

        guard let periods = self.periods() else { return }
        if periods.count < 1 { return }

        var flattenedPeriods = [DTTimePeriod]()
        let flatdate = DTTimePeriod()

        for period in periods {

            guard let periodStart = period.StartDate, let periodEnd = period.EndDate else { continue }

            if !flatdate.hasStartDate() { flatdate.StartDate = periodStart }
            if !flatdate.hasEndDate() { flatdate.EndDate = periodEnd }

            if periodStart.isEarlierThanOrEqualTo(flatdate.EndDate) && periodEnd.isGreaterThanOrEqualTo(flatdate.EndDate) {

                flatdate.EndDate = periodEnd

            } else {

                flattenedPeriods.append(flatdate.copy())
                flatdate.StartDate = periodStart
                flatdate.EndDate = periodEnd
            }
        }

        flattenedPeriods.append(flatdate.copy())

        // delete all periods
        for var i = 0 ; i < periods.count ; i++ { self.removeTimePeriodAtIndex(0) }

        // add flattened periods to self
        for flat in flattenedPeriods { self.addTimePeriod(flat) }
    }

0
为了帮助回答这个问题,这里提供了问题中给出的示例数据,并使用类似 Hainstech 的表变量进行了处理:
declare @T1 table (d1 datetime, d2 datetime)

insert @T1 (d1,d2)
select           '02 June 2009 05:55:28','02 June 2009 09:58:27'
union all select '02 June 2009 10:15:19','02 June 2009 13:58:24'
union all select '02 June 2009 13:58:24','02 June 2009 13:58:43'
union all select '03 June 2009 05:54:48','03 June 2009 10:00:13'
union all select '03 June 2009 09:26:45','03 June 2009 09:59:40'
union all select '03 June 2009 10:12:50','03 June 2009 10:27:25'
union all select '03 June 2009 10:13:12','03 June 2009 11:14:56'
union all select '03 June 2009 10:27:25','03 June 2009 10:27:31'
union all select '03 June 2009 10:27:39','03 June 2009 13:53:38'
union all select '03 June 2009 11:14:56','03 June 2009 11:15:03'
union all select '03 June 2009 11:15:30','03 June 2009 14:02:14'
union all select '03 June 2009 13:53:38','03 June 2009 13:53:43'
union all select '03 June 2009 14:02:14','03 June 2009 14:02:31'
union all select '04 June 2009 05:48:27','04 June 2009 09:58:59'
union all select '04 June 2009 06:00:00','04 June 2009 09:59:07'
union all select '04 June 2009 10:15:52','04 June 2009 13:54:52'
union all select '04 June 2009 10:16:01','04 June 2009 13:24:20'
union all select '04 June 2009 13:24:20','04 June 2009 13:24:24'
union all select '04 June 2009 13:24:32','04 June 2009 14:00:39'
union all select '04 June 2009 13:54:52','04 June 2009 13:54:58'
union all select '04 June 2009 14:00:39','04 June 2009 14:00:49'
union all select '05 June 2009 05:53:58','05 June 2009 09:59:12'
union all select '05 June 2009 10:16:05','05 June 2009 13:59:08'
union all select '05 June 2009 13:59:08','05 June 2009 13:59:16'
union all select '06 June 2009 06:04:00','06 June 2009 10:00:00'
union all select '06 June 2009 10:16:54','06 June 2009 10:18:40'
union all select '06 June 2009 10:18:40','06 June 2009 10:18:45'
union all select '06 June 2009 10:23:00','06 June 2009 13:57:00'
union all select '06 June 2009 10:23:48','06 June 2009 13:57:54'
union all select '06 June 2009 13:57:21','06 June 2009 13:57:38'
union all select '06 June 2009 13:57:54','06 June 2009 13:57:58'
union all select '07 June 2009 21:59:30','07 June 2009 01:58:49'
union all select '07 June 2009 22:12:16','07 June 2009 01:58:39'
union all select '07 June 2009 22:12:25','07 June 2009 01:58:28'
union all select '08 June 2009 02:10:33','08 June 2009 05:56:11'
union all select '08 June 2009 02:10:43','08 June 2009 05:56:23'
union all select '08 June 2009 02:10:49','08 June 2009 05:55:59'
union all select '08 June 2009 05:55:59','08 June 2009 05:56:01'
union all select '08 June 2009 05:56:11','08 June 2009 05:56:14'
union all select '08 June 2009 05:56:23','08 June 2009 05:56:27'

0

我需要做类似的事情,并且希望找到一个对我有意义的解决方案:

我首先创建一个"最小时间间隔"的表,每个时间间隔是连续时间点之间的间隔(无论这些时间点对是否在同一时间跨度内):

create table Time_Pairs
as
select d1, 
       lead(d1) over (order by d1) as d2
from ( -- all time points
    select d1
    from (select d1 from Test_Time_Spans union all select d2 from Test_Time_Spans)
    group by 1
    )

然后我只保留在单一时间跨度内的配对,并标记较小的时间跨度是否与前一个和后一个时间跨度连续:

create table Non_Overlapping_Time_Spans
as
select s.d1,
       s.d2,
       case
           when s.d1=lag(s.d2) over (order by s.d1) then 1 -- prev time interval is contiguous with this one
           else 0
       end as merge_prev,
       case
           when s.d2=lead(s.d1) over (order by s.d1) then 1 -- next time interval is contiguous with this one
           else 0
       end as merge_next
from Test_Time_Spans t inner join
     Time_Pairs s
on s.d1>=t.d1
and s.d2<=t.d2
group by 1,2

最后,我只保留每个块的开始时间和结束时间,只选择merge_prev或merge_next等于0的时间。
select distinct -- need to de-dupe
       case 
           when merge_prev=0 then d1
           else lag(d1) over (order by d1)
       end as d1,
       case 
           when merge_next=0 then d2
           else lead(d2) over (order by d1)
       end as d2
from Non_Overlapping_Time_Spans
where merge_prev=0 or merge_next=0
order by 1

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接