数据集中记录具有连续日期范围,如何获取最小和最大日期的日期范围?

4
我有一个数据集,其包含员工的id、状态和日期范围。下面是给出一个员工详细信息的输入数据集。记录中的日期范围是连续的(按确切顺序),因此第二行的开始日期将是第一行结束日期的下一个日期。 如果一个员工连续几个月请假,则表格会将日期范围分开存储。 例如,在输入集中,员工从'16-10-2016'到'31-12-2016'请病假,并在'1-1-2017'回来。因此,这个项目有3条记录,但日期是连续的。我需要输出时将其作为一个记录显示,如预期的输出数据集所示。 输入
Id  Status   StartDate   EndDate

1   Active   1-9-2007    15-10-2016
1   Sick     16-10-2016  31-10-2016
1   Sick     1-11-2016   30-11-2016
1   Sick     1-12-2016   31-12-2016
1   Active   1-1-2017    4-2-2017  
1   Unpaid   5-2-2017    9-2-2017  
1   Active   10-2-2017   11-2-2017 
1   Unpaid   12-2-2017   28-2-2017 
1   Unpaid   1-3-2017    31-3-2017 
1   Unpaid   1-4-2017    30-4-2017 
1   Active   1-5-2017    13-10-2017
1   Sick     14-10-2017  11-11-2017
1   Active   12-11-2017  NULL   

预期输出

Id   Status    StartDate    EndDate

1    Active    1-9-2007     15-10-2016
1    Sick      16-10-2016   31-12-2016
1    Active    1-1-2017     4-2-2017  
1    Unpaid    5-2-2017     9-2-2017  
1    Active    10-2-2017    11-2-2017 
1    Unpaid    12-2-2017    30-4-2017 
1    Active    1-5-2017     13-10-2017
1    Sick      14-10-2017   11-11-2017
1    Active    12-11-2017   NULL  

我无法在id、status分组时取min(startdate)和max(EndDate),因为如果同一名员工请了另一个病假,那么该结束日期(例如示例中的'11-11-2017')将成为最新的结束日期。

有谁能帮忙写一个SQL Server 2014的查询语句吗?

4个回答

3
我突然意识到这基本上是一个间隙和岛屿问题 - 所以我完全改变了我的解决方案。
为了使该解决方案起作用,日期不必连续。
首先,创建并填充示例表格(在您将来的问题中省略此步骤)。
DECLARE @T AS TABLE
(
    Id int,
    Status varchar(10),
    StartDate date,
    EndDate date
);

SET DATEFORMAT DMY; -- This is needed because how you specified your dates.

INSERT INTO @T (Id, Status, StartDate, EndDate) VALUES
(1, 'Active', '1-9-2007', '15-10-2016'),
(1, 'Sick', '16-10-2016', '31-10-2016'),
(1, 'Sick', '1-11-2016', '30-11-2016'),
(1, 'Sick', '1-12-2016', '31-12-2016'),
(1, 'Active', '1-1-2017', '4-2-2017'),
(1, 'Unpaid', '5-2-2017', '9-2-2017'),
(1, 'Active', '10-2-2017', '11-2-2017'),
(1, 'Unpaid', '12-2-2017', '28-2-2017'),
(1, 'Unpaid', '1-3-2017', '31-3-2017'),
(1, 'Unpaid', '1-4-2017', '30-4-2017'),
(1, 'Active', '1-5-2017', '13-10-2017'),
(1, 'Sick', '14-10-2017', '11-11-2017'),
(1, 'Active', '12-11-2017',  NULL);

新的常用表达式:

;WITH CTE AS
(
SELECT  Id, 
        Status, 
        StartDate, 
        EndDate,
        ROW_NUMBER() OVER(PARTITION BY Id ORDER BY StartDate)
        - ROW_NUMBER() OVER(PARTITION BY Id, Status ORDER BY StartDate) As IslandId,
        ROW_NUMBER() OVER(PARTITION BY Id ORDER BY StartDate DESC)
        - ROW_NUMBER() OVER(PARTITION BY Id, Status ORDER BY StartDate DESC) As ReverseIslandId
FROM @T
)

新查询:

SELECT  DISTINCT Id,
        Status,
        MIN(StartDate) OVER(PARTITION BY IslandId, ReverseIslandId) As StartDate,
        NULLIF(MAX(ISNULL(EndDate, '9999-12-31')) OVER(PARTITION BY IslandId, ReverseIslandId), '9999-12-31') As EndDate

FROM CTE 
ORDER BY StartDate

(新)结果:

Id  Status  StartDate   EndDate
1   Active  01.09.2007  15.10.2016
1   Sick    16.10.2016  31.12.2016
1   Active  01.01.2017  04.02.2017
1   Unpaid  05.02.2017  09.02.2017
1   Active  10.02.2017  11.02.2017
1   Unpaid  12.02.2017  30.04.2017
1   Active  01.05.2017  13.10.2017
1   Sick    14.10.2017  11.11.2017
1   Active  12.11.2017  NULL

您可以在rextester上看到实时演示。

请注意,SQL中日期的字符串表示应符合ISO 8601 - 这意味着使用yyyy-MM-ddyyyyMMdd,因为它是无歧义的,并且始终会被SQL Server正确解释。


感谢您的回答。但是在输出中(第二行),预期的结束日期是2016年12月31日,而不是2016年10月31日。那是病假的最后一天。此外,对于无薪休假(第六行),预期的结束日期是2017年4月30日。 - DJs
好的,我会花几分钟时间来处理并编辑我的答案。 - Zohar Peled

2

这是一个关于分组和窗口的示例。

  • 首先,您需要为每个状态设置一个重置点
  • 使用Sum来设置一个分组
  • 然后获取每个分组的最大/最小日期。
;with x as
(
    select Id, Status, StartDate,  EndDate,
           iif (lag(Status) over (order by Id, StartDate) = Status, null, 1) rst
    from   emp
), y as
 (
    select Id, Status, StartDate, EndDate,
           sum(rst) over (order by Id, StartDate) grp
    from   x
 )
 
 select Id, 
        MIN(Status) as Status, 
        MIN(StartDate) StartDate, 
        MAX(EndDate) EndDate
 from   y
 group by Id, grp
 order by Id, grp

GO
Id | Status | StartDate           | EndDate            
-: | :----- | :------------------ | :------------------
 1 | Active | 01/09/2007 00:00:00 | 15/10/2016 00:00:00
 1 | Sick   | 16/10/2016 00:00:00 | 31/12/2016 00:00:00
 1 | Active | 01/01/2017 00:00:00 | 04/02/2017 00:00:00
 1 | Unpaid | 05/02/2017 00:00:00 | 09/02/2017 00:00:00
 1 | Active | 10/02/2017 00:00:00 | 11/02/2017 00:00:00
 1 | Unpaid | 12/02/2017 00:00:00 | 30/04/2017 00:00:00
 1 | Active | 01/05/2017 00:00:00 | 13/10/2017 00:00:00
 1 | Sick   | 14/10/2017 00:00:00 | 11/11/2017 00:00:00
 1 | Active | 12/11/2017 00:00:00 | null               

dbfiddle here


1
这是一种不使用LAG函数的替代答案。
首先,我需要复制你的测试数据:
DECLARE @table TABLE (Id INT, [Status] VARCHAR(50), StartDate DATE, EndDate DATE);
INSERT INTO @table SELECT 1, 'Active', '20070901', '20161015';
INSERT INTO @table SELECT 1, 'Sick', '20161016', '20161031';
INSERT INTO @table SELECT 1, 'Sick', '20161101', '20161130';
INSERT INTO @table SELECT 1, 'Sick', '20161201', '20161231';
INSERT INTO @table SELECT 1, 'Active', '20170101', '20170204';
INSERT INTO @table SELECT 1, 'Unpaid', '20170205', '20170209';
INSERT INTO @table SELECT 1, 'Active', '20170210', '20170211';
INSERT INTO @table SELECT 1, 'Unpaid', '20170212', '20170228';
INSERT INTO @table SELECT 1, 'Unpaid', '20170301', '20170331';
INSERT INTO @table SELECT 1, 'Unpaid', '20170401', '20170430';
INSERT INTO @table SELECT 1, 'Active', '20170501', '20171013';
INSERT INTO @table SELECT 1, 'Sick', '20171014', '20171111';
INSERT INTO @table SELECT 1, 'Active', '20171112', NULL;

然后查询是:
WITH add_order AS (
    SELECT
        *,
        ROW_NUMBER() OVER (ORDER BY StartDate) AS order_id
    FROM
        @table),
links AS (
    SELECT
        a1.Id,
        a1.[Status],
        a1.order_id,
        MIN(a1.order_id) AS start_order_id,
        MAX(ISNULL(a2.order_id, a1.order_id)) AS end_order_id,
        MIN(a1.StartDate) AS StartDate,
        MAX(ISNULL(a2.EndDate, a1.EndDate)) AS EndDate
    FROM
        add_order a1
        LEFT JOIN add_order a2 ON a2.Id = a1.Id AND a2.[Status] = a1.[Status] AND a2.order_id = a1.order_id + 1 AND a2.StartDate = DATEADD(DAY, 1, a1.EndDate)
    GROUP BY
        a1.Id,
        a1.[Status],
        a1.order_id),
merged AS (
    SELECT
        l1.Id,
        l1.[Status],
        l1.[StartDate],
        ISNULL(l2.EndDate, l1.EndDate) AS EndDate,
        ROW_NUMBER() OVER (PARTITION BY l1.Id, l1.[Status], ISNULL(l2.EndDate, l1.EndDate) ORDER BY l1.order_id) AS link_id
    FROM
        links l1
        LEFT JOIN links l2 ON l2.order_id = l1.end_order_id)
SELECT
    Id,
    [Status],
    StartDate,
    EndDate
FROM
    merged
WHERE
    link_id = 1
ORDER BY
    StartDate;

结果如下:
Id  Status  StartDate   EndDate
1   Active  2007-09-01  2016-10-15
1   Sick    2016-10-16  2016-12-31
1   Active  2017-01-01  2017-02-04
1   Unpaid  2017-02-05  2017-02-09
1   Active  2017-02-10  2017-02-11
1   Unpaid  2017-02-12  2017-04-30
1   Active  2017-05-01  2017-10-13
1   Sick    2017-10-14  2017-11-11
1   Active  2017-11-12  NULL

首先,我添加一个序列号以帮助合并相邻行。然后确定可以合并的行,添加一个数字来标识每个可合并集合中的第一行,最后从最终CTE中挑选出第一行。请注意,我还必须处理无法合并的行,因此使用了LEFT JOIN和ISNULL语句。
仅供参考,在过滤掉除link_id为1之外的所有行之前,最终CTE的输出如下所示:
Id  Status  StartDate   EndDate link_id
1   Active  2007-09-01  2016-10-15  1
1   Sick    2016-10-16  2016-12-31  1
1   Sick    2016-11-01  2016-12-31  2
1   Sick    2016-12-01  2016-12-31  3
1   Active  2017-01-01  2017-02-04  1
1   Unpaid  2017-02-05  2017-02-09  1
1   Active  2017-02-10  2017-02-11  1
1   Unpaid  2017-02-12  2017-04-30  1
1   Unpaid  2017-03-01  2017-04-30  2
1   Unpaid  2017-04-01  2017-04-30  3
1   Active  2017-05-01  2017-10-13  1
1   Sick    2017-10-14  2017-11-11  1
1   Active  2017-11-12  NULL        1

0
你可以使用lag()lead()函数一起来检查前一个和后一个状态。
WITH CTE AS
(
    select  *, 
            COALESCE(LEAD(status) OVER(ORDER BY (select 1)), '0') Nstatus,
            COALESCE(LAG(status) OVER(ORDER BY (select 1)), '0') Pstatus
   from table
)

SELECT * FROM CTE
WHERE (status <> Nstatus AND status <> Pstatus) OR
      (status <> Pstatus) 

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接