合并连续的日期范围。

12

使用 SQL Server 2008 R2,我正在尝试将日期范围合并为最大日期范围,因为一个结束日期紧接着下一个开始日期。

这些数据是有关不同雇佣情况的。一些员工可能已经结束了他们的雇佣,并在以后的某个时间重新加入。这些应该被视为两个不同的雇佣(例如ID 5)。有些人有不同类型的雇佣,在彼此之后运行(enddate和startdate并排),在这种情况下,它应该被视为总共一个雇佣(例如ID 30)。

没有结束的就业期限具有空的enddate。

一些示例可能会启发您:

declare @t as table  (employmentid int, startdate datetime, enddate datetime)

insert into @t values
(5, '2007-12-03', '2011-08-26'),
(5, '2013-05-02', null),
(30, '2006-10-02', '2011-01-16'),
(30, '2011-01-17', '2012-08-12'),
(30, '2012-08-13', null),
(66, '2007-09-24', null)

-- expected outcome
EmploymentId StartDate   EndDate
5            2007-12-03  2011-08-26
5            2013-05-02  NULL
30           2006-10-02  NULL
66           2007-09-24  NULL

我一直在尝试不同的“岛屿和间隙”技术,但一直无法解决这个问题。


一个合适的重叠不应该是 startDate == endDate 吗?否则就会有 24 小时没有被计算。 - MaxH
这将是存储过程,是吗?还是您受到查询限制? - orrollo
@MaxH:实际上,这些日期时间被用作日期。因此重叠是可以的。 - Jonas Lincoln
@JonasLincoln:是的,我明白这一点,但如果你要计算员工的雇佣天数,你会得到不同的结果。在上面的例子中,employmentid 30 工作了 1567 + 573 + 234 = 2374 天(null = 2013-04-04 = 今天)。这与 employmentid 30 的摘要不同(从2006-10-02到2013-04-04共2376天)。每次雇佣类型更改时,您将少1天。 - MaxH
4个回答

16

我的日期使用 '31211231' 只是为了处理你的“无结束日期”的情况,这是一个非常大的日期。我假设你不会有很多每个员工的日期范围,因此我使用了一个简单的递归公共表达式来合并这些范围。

为了使它运行更快,起始锚点查询仅保留那些不会连接到先前范围(每个员工)的日期。其余部分只是遍历日期范围并扩展范围。最终的GROUP BY仅保留每个起始锚点(employmentid,startdate)组合构建的最大日期范围。


SQL Fiddle

MS SQL Server 2008 Schema Setup:

create table Tbl (
  employmentid int,
  startdate datetime,
  enddate datetime);

insert Tbl values
(5, '2007-12-03', '2011-08-26'),
(5, '2013-05-02', null),
(30, '2006-10-02', '2011-01-16'),
(30, '2011-01-17', '2012-08-12'),
(30, '2012-08-13', null),
(66, '2007-09-24', null);

/*
-- expected outcome
EmploymentId StartDate   EndDate
5            2007-12-03  2011-08-26
5            2013-05-02  NULL
30           2006-10-02  NULL
66           2007-09-24  NULL
*/

查询 1:

;with cte as (
   select a.employmentid, a.startdate, a.enddate
     from Tbl a
left join Tbl b on a.employmentid=b.employmentid and a.startdate-1=b.enddate
    where b.employmentid is null
    union all
   select a.employmentid, a.startdate, b.enddate
     from cte a
     join Tbl b on a.employmentid=b.employmentid and b.startdate-1=a.enddate
)
   select employmentid,
          startdate,
          nullif(max(isnull(enddate,'32121231')),'32121231') enddate
     from cte
 group by employmentid, startdate
 order by employmentid

结果:

| EMPLOYMENTID |                        STARTDATE |                       ENDDATE |
-----------------------------------------------------------------------------------
|            5 |  December, 03 2007 00:00:00+0000 | August, 26 2011 00:00:00+0000 |
|            5 |       May, 02 2013 00:00:00+0000 |                        (null) |
|           30 |   October, 02 2006 00:00:00+0000 |                        (null) |
|           66 | September, 24 2007 00:00:00+0000 |                        (null) |

1
六年后,这仍然是小型日期组的绝佳解决方案。谢谢! - Joe S
在CTE的第一个投影中,应该使用b.startdate而不是a.startdate吗?应该这样写:;with cte as (select a.employmentid, b.startdate, a.enddate . - nir
@nir 不,CTE的第一部分提供了我们正在构建的组中的第一条记录。然后,第二部分延长了这条第一条记录的结束日期。 - undefined

4
一种使用窗口函数而不是递归CTE的替代方案。
SELECT 
    employmentid, 
    MIN(startdate) as startdate, 
    NULLIF(MAX(COALESCE(enddate,'9999-01-01')), '9999-01-01') as enddate
FROM (
    SELECT 
        employmentid, 
        startdate, 
        enddate,
        DATEADD(
            DAY, 
            -COALESCE(
                SUM(DATEDIFF(DAY, startdate, enddate)+1) OVER (PARTITION BY employmentid ORDER BY startdate ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 
                0
            ),
            startdate
    ) as grp
    FROM @t
) withGroup
GROUP BY employmentid, grp
ORDER BY employmentid, startdate

这是通过计算一个grp值来实现的,该值对于所有连续行都相同。这是通过以下方式实现的:

  1. 确定跨度占用的总天数(包括日期在内+1)
SELECT *, DATEDIFF(DAY, startdate, enddate)+1 as daysSpanned FROM @t

Cumulative sum the days spanned for each employment, ordered by startdate. This gives us the total days spanned by all the previous employment spans.
  • We use coalesce with 0 to avoid NULLs in our cumulative sum of days spanned.
  • We do not include the current row in our cumulative sum. This is because we will use the value against startdate rather than enddate (we cannot use it against enddate due to the NULLs).
SELECT *, COALESCE(
    SUM(daysSpanned) OVER (
        PARTITION BY employmentid 
        ORDER BY startdate 
        ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
    )
    ,0
)  as cumulativeDaysSpanned
FROM (
    SELECT *, DATEDIFF(DAY, startdate, enddate)+1 as daysSpanned FROM @t
) inner1
  1. startdate中减去累计天数以得到我们的grp。这是解决方案的关键。
    • 如果开始日期以与跨越的天数相同的速率增加,则天数是连续的,将两者相减将给出相同的值。
    • 如果startdate增长速度快于跨越的天数,则存在间隔,我们将获得一个新的grp值大于以前的值。
    • 尽管grp是一个日期,但日期本身是无意义的,我们仅使用它作为分组值。
SELECT *, DATEADD(DAY, -cumulativeDaysSpanned, startdate) as grp
FROM (
    SELECT *, COALESCE(
        SUM(daysSpanned) OVER (
            PARTITION BY employmentid 
            ORDER BY startdate 
            ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
        )
        ,0
    )  as cumulativeDaysSpanned
    FROM (
        SELECT *, DATEDIFF(DAY, startdate, enddate)+1 as daysSpanned FROM @t
    ) inner1
) inner2

通过这些结果

+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| employmentid | startdate               | enddate                 | daysSpanned | cumulativeDaysSpanned | grp                     |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| 5            | 2007-12-03 00:00:00.000 | 2011-08-26 00:00:00.000 | 1363        | 0                     | 2007-12-03 00:00:00.000 |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| 5            | 2013-05-02 00:00:00.000 | NULL                    | NULL        | 1363                  | 2009-08-08 00:00:00.000 |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| 30           | 2006-10-02 00:00:00.000 | 2011-01-16 00:00:00.000 | 1568        | 0                     | 2006-10-02 00:00:00.000 |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| 30           | 2011-01-17 00:00:00.000 | 2012-08-12 00:00:00.000 | 574         | 1568                  | 2006-10-02 00:00:00.000 |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| 30           | 2012-08-13 00:00:00.000 | NULL                    | NULL        | 2142                  | 2006-10-02 00:00:00.000 |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
| 66           | 2007-09-24 00:00:00.000 | NULL                    | NULL        | 0                     | 2007-09-24 00:00:00.000 |
+--------------+-------------------------+-------------------------+-------------+-----------------------+-------------------------+
  1. 最后,我们可以使用GROUP BY grp来消除连续的日期。
    • 使用MINMAX获取新的startdateenddate
    • 为了处理空的enddate,我们给它们一个大值以被MAX选中,然后再将它们转换回NULL
SELECT 
    employmentid, 
    MIN(startdate) as startdate, 
    NULLIF(MAX(COALESCE(enddate,'9999-01-01')), '9999-01-01') as enddate
FROM (
    SELECT *, DATEADD(DAY, -cumulativeDaysSpanned, startdate) as grp
    FROM (
        SELECT *, COALESCE(
            SUM(daysSpanned) OVER (
                PARTITION BY employmentid 
                ORDER BY startdate 
                ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
            )
            ,0
        )  as cumulativeDaysSpanned
        FROM (
            SELECT *, DATEDIFF(DAY, startdate, enddate)+1 as daysSpanned FROM @t
        ) inner1
    ) inner2
) inner3
GROUP BY employmentid, grp
ORDER BY employmentid, startdate

为了获得期望的结果。
+--------------+-------------------------+-------------------------+
| employmentid | startdate               | enddate                 |
+--------------+-------------------------+-------------------------+
| 5            | 2007-12-03 00:00:00.000 | 2011-08-26 00:00:00.000 |
+--------------+-------------------------+-------------------------+
| 5            | 2013-05-02 00:00:00.000 | NULL                    |
+--------------+-------------------------+-------------------------+
| 30           | 2006-10-02 00:00:00.000 | NULL                    |
+--------------+-------------------------+-------------------------+
| 66           | 2007-09-24 00:00:00.000 | NULL                    |
+--------------+-------------------------+-------------------------+
  1. 我们可以将内部查询组合起来,得到本答案开头的查询。这个查询更短,但解释性较差。

所有这些的限制要求:

  • 就业开始日期和结束日期没有重叠。否则会在 grp 中产生冲突。
  • 开始日期不为 NULL。但是,可以通过用小日期值替换 NULL 开始日期来克服此问题。
  • 未来的开发人员可以理解您执行的窗口黑魔法。

2
SET NOCOUNT ON

DECLARE @T TABLE(ID INT,FromDate DATETIME, ToDate DATETIME)

INSERT INTO @T(ID,FromDate,ToDate)
SELECT 1,'20090801','20090803' UNION ALL
SELECT 2,'20090802','20090809' UNION ALL
SELECT 3,'20090805','20090806' UNION ALL
SELECT 4,'20090812','20090813' UNION ALL
SELECT 5,'20090811','20090812' UNION ALL
SELECT 6,'20090802','20090802'


SELECT ROW_NUMBER() OVER(ORDER BY s1.FromDate) AS ID,
       s1.FromDate, 
       MIN(t1.ToDate) AS ToDate 
FROM @T s1 
INNER JOIN @T t1 ON s1.FromDate <= t1.ToDate 
  AND NOT EXISTS(SELECT * FROM @T t2 
                 WHERE t1.ToDate >= t2.FromDate
                   AND t1.ToDate < t2.ToDate) 
WHERE NOT EXISTS(SELECT * FROM @T s2 
                 WHERE s1.FromDate > s2.FromDate
                   AND s1.FromDate <= s2.ToDate) 
GROUP BY s1.FromDate 
ORDER BY s1.FromDate

3
不要提供简单的代码,尝试解释思考过程,以使所有寻找答案的人受益。 - Ubercool
1
看起来逻辑是这样的:在所有范围合并之后,一组合并范围中的第一个范围具有没有在任何其他范围中的开始日期,而该组中的最后一个范围具有没有在其他范围中的结束日期。查询查找所有第一个范围(s1),并找到相应的最后一个范围(MIN(t1.ToDate)对应于在s1之后结束的最早的最后一个范围)。EXISTS条件将s1限制为第一个范围,将t1限制为最后一个范围。 - Victor

1
一个用于合并所有重叠时间段的修改脚本。
例如:
01.01.2001-01.01.2010
05.05.2005-05.05.2015

将会得到一个时间段:
01.01.2001-05.05.2015

tbl.enddate必须填写完整。

;WITH cte
  AS(
SELECT
  a.employmentid
  ,a.startdate
  ,a.enddate
from tbl a
left join tbl c on a.employmentid=c.employmentid
    and a.startdate > c.startdate
    and a.startdate <= dateadd(day, 1, c.enddate)
WHERE c.employmentid IS NULL

UNION all

SELECT
  a.employmentid
  ,a.startdate
  ,a.enddate
from cte a
inner join tbl c on a.startdate=c.startdate
    and (c.startdate = dateadd(day, 1, a.enddate) or (c.enddate > a.enddate and c.startdate <= a.enddate))
)
select distinct employmentid,
          startdate,
          nullif(max(enddate),'31.12.2099') enddate
from cte
group by employmentid, startdate

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接