使用ROW_NUMBER() OVER (PARTITION BY ...)遇到麻烦了。

20

我正在使用SQL Server 2008 R2。我有一个名为EmployeeHistory的表,它具有以下结构和示例数据:

EmployeeID Date      DepartmentID SupervisorID
10001      20130101  001          10009
10001      20130909  001          10019
10001      20131201  002          10018
10001      20140501  002          10017
10001      20141001  001          10015
10001      20141201  001          10014

请注意,员工10001在过去改变了2个部门和几位主管。我的目标是按日期字段的顺序列出该员工在每个部门中的就业起始和结束日期。因此,输出将如下所示:

EmployeeID DateStart DateEnd  DepartmentID 
10001      20130101  20131201 001
10001      20131201  20141001 002
10001      20141001  NULL     001

我原本想使用以下查询来对数据进行分区,但失败了。部门从001更改为002,然后又改回001。显然,我无法按DepartmentID进行分区...我确定我忽略了一些明显的东西。有人能帮忙吗?提前感谢。

SELECT * ,ROW_NUMBER() OVER (PARTITION BY EmployeeID, DepartmentID
ORDER BY [Date]) RN FROM EmployeeHistory
3个回答

11

我会这样做:

;WITH x 
 AS (SELECT *, 
            Row_number() 
              OVER( 
                partition BY employeeid 
                ORDER BY datestart) rn 
     FROM   employeehistory) 
SELECT * 
FROM   x x1 
   LEFT OUTER JOIN x x2 
                ON x1.rn = x2.rn + 1 

或者可能是 x2.rn - 1。你需要查看一下。无论如何,你已经有了想法。一旦你将表格自己连接起来,你就可以过滤、分组、排序等等来获得所需的内容。


9
这有点复杂。最简单的方法是参考我为您创建的这个SQL Fiddle,它可以产生精确的结果。你可以通过提高性能或其他考虑来改进它,但希望至少比某些替代方法更清晰。

要点是,首先得到数据的规范排名,然后使用它将数据分成组,然后为每个组找到结束日期,然后消除任何中间行。使用ROW_NUMBER()和CROSS APPLY在可读性上有很大帮助。


编辑2019:

SQL Fiddle似乎实际上出了问题,但看起来这是SQL Fiddle网站的问题。这里是一个完整的版本,在SQL Server 2016上刚刚测试过:

CREATE TABLE Source
(
  EmployeeID int,
  DateStarted date,
  DepartmentID int
)

INSERT INTO Source
VALUES
(10001,'2013-01-01',001),
(10001,'2013-09-09',001),
(10001,'2013-12-01',002),
(10001,'2014-05-01',002),
(10001,'2014-10-01',001),
(10001,'2014-12-01',001)


SELECT *, 
  ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY DateStarted) AS EntryRank,
  newid() as GroupKey,
  CAST(NULL AS date) AS EndDate
INTO #RankedData
FROM Source
;

UPDATE #RankedData
SET GroupKey = beginDate.GroupKey
FROM #RankedData sup
  CROSS APPLY 
  (
    SELECT TOP 1 GroupKey
    FROM #RankedData sub 
    WHERE sub.EmployeeID = sup.EmployeeID AND
      sub.DepartmentID = sup.DepartmentID AND
      NOT EXISTS 
        (
          SELECT * 
          FROM #RankedData bot 
          WHERE bot.EmployeeID = sup.EmployeeID AND
            bot.EntryRank BETWEEN sub.EntryRank AND sup.EntryRank AND
            bot.DepartmentID <> sup.DepartmentID
        )
      ORDER BY DateStarted ASC
    ) beginDate (GroupKey);

UPDATE #RankedData
SET EndDate = nextGroup.DateStarted
FROM #RankedData sup
  CROSS APPLY 
  (
    SELECT TOP 1 DateStarted
    FROM #RankedData sub
    WHERE sub.EmployeeID = sup.EmployeeID AND
      sub.DepartmentID <> sup.DepartmentID AND
      sub.EntryRank > sup.EntryRank
    ORDER BY EntryRank ASC
  ) nextGroup (DateStarted);

SELECT * FROM 
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY GroupKey ORDER BY EntryRank ASC) AS GroupRank FROM #RankedData
) FinalRanking
WHERE GroupRank = 1
ORDER BY EntryRank;

DROP TABLE #RankedData
DROP TABLE Source

两年多过去了,@coding_idiot,但我已经通过一个完整的脚本来纠正它,可以在您自己的服务器上运行。 - Dominic P
@DominicP,我认为可以用更简单和高效的方法来完成。看看我的答案。 - Vladimir Baranov

2

这似乎是一个常见的间隔和岛屿问题。两个行号序列rn1rn2之间的差异给出了“组”编号。

逐个运行此查询CTE并检查中间结果以了解其工作原理。

样本数据

我稍微扩展了问题中的样本数据。

DECLARE @Source TABLE
(
    EmployeeID int,
    DateStarted date,
    DepartmentID int
)

INSERT INTO @Source
VALUES
(10001,'2013-01-01',001),
(10001,'2013-09-09',001),
(10001,'2013-12-01',002),
(10001,'2014-05-01',002),
(10001,'2014-10-01',001),
(10001,'2014-12-01',001),

(10005,'2013-05-01',001),
(10005,'2013-11-09',001),
(10005,'2013-12-01',002),
(10005,'2014-10-01',001),
(10005,'2016-12-01',001);

SQL Server 2008查询

在SQL Server 2008中没有LEAD函数,因此我不得不使用OUTER APPLY进行自连接,以获取DateEnd的“下一行”值。

WITH
CTE
AS
(
    SELECT
        EmployeeID
        ,DateStarted
        ,DepartmentID
        ,ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY DateStarted) AS rn1
        ,ROW_NUMBER() OVER (PARTITION BY EmployeeID, DepartmentID ORDER BY DateStarted) AS rn2
    FROM @Source
)
,CTE_Groups
AS
(
    SELECT
        EmployeeID
        ,MIN(DateStarted) AS DateStart
        ,DepartmentID
    FROM CTE
    GROUP BY
        EmployeeID
        ,DepartmentID
        ,rn1 - rn2
)
SELECT
    CTE_Groups.EmployeeID
    ,CTE_Groups.DepartmentID
    ,CTE_Groups.DateStart
    ,A.DateEnd
FROM
    CTE_Groups
    OUTER APPLY
    (
        SELECT TOP(1) G2.DateStart AS DateEnd
        FROM CTE_Groups AS G2
        WHERE
            G2.EmployeeID = CTE_Groups.EmployeeID
            AND G2.DateStart > CTE_Groups.DateStart
        ORDER BY G2.DateStart
    ) AS A
ORDER BY
    EmployeeID
    ,DateStart
;

查询SQL Server 2012+

从SQL Server 2012开始,有一个LEAD函数可以使这个任务更加高效。

WITH
CTE
AS
(
    SELECT
        EmployeeID
        ,DateStarted
        ,DepartmentID
        ,ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY DateStarted) AS rn1
        ,ROW_NUMBER() OVER (PARTITION BY EmployeeID, DepartmentID ORDER BY DateStarted) AS rn2
    FROM @Source
)
,CTE_Groups
AS
(
    SELECT
        EmployeeID
        ,MIN(DateStarted) AS DateStart
        ,DepartmentID
    FROM CTE
    GROUP BY
        EmployeeID
        ,DepartmentID
        ,rn1 - rn2
)
SELECT
    CTE_Groups.EmployeeID
    ,CTE_Groups.DepartmentID
    ,CTE_Groups.DateStart
    ,LEAD(CTE_Groups.DateStart) OVER (PARTITION BY CTE_Groups.EmployeeID ORDER BY CTE_Groups.DateStart) AS DateEnd
FROM
    CTE_Groups
ORDER BY
    EmployeeID
    ,DateStart
;

结果

+------------+--------------+------------+------------+
| EmployeeID | DepartmentID | DateStart  |  DateEnd   |
+------------+--------------+------------+------------+
|      10001 |            1 | 2013-01-01 | 2013-12-01 |
|      10001 |            2 | 2013-12-01 | 2014-10-01 |
|      10001 |            1 | 2014-10-01 | NULL       |
|      10005 |            1 | 2013-05-01 | 2013-12-01 |
|      10005 |            2 | 2013-12-01 | 2014-10-01 |
|      10005 |            1 | 2014-10-01 | NULL       |
+------------+--------------+------------+------------+

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接