SQL分区分组

3

我有以下表格:

ID  Rating  Rating_from Rating_to
1   2       2010-01-01  2011-01-01
1   2       2011-01-02  2012-02-01
1   3       2012-02-02  2013-03-01
1   2       2013-03-02  2013-04-01
1   2       2013-04-02  9999-12-31

它包含每个ID的一个评分,这是在特定情况下进行检查的。每次检查评分时,将给最后一行一个Rating_to日期,通常是新评分前一天,并输入一行新记录,其中Rating_from日期为实际评分当天。而Rating_to会被设置为9999-12-31而不是NULL。经常情况下,评级保持不变。有时候,评级会发生变化。一个ID可能随着时间推移重新获得它曾经拥有的评级。
如何获取每个ID每个评级的最早Rating_from日期和最新Rating_to日期,而不将具有相同评级值但与其他评级交错的评级分组?
我正在尝试获取以下表格:
ID  Rating  Rating_from Rating_to
1   2       2010-01-01  2012-02-01
1   3       2012-02-02  2013-03-01
1   2       2013-03-02  NULL

使用上述数据,我尝试按IDRating分组(并在from和to字段上设置MIN()MAX()),但是我只得到了两行,一个是评分为2,另一个是评分为3,即使有两个评分为2的时期。
我向同事请教,他建议使用LAG()LEAD(),但我不确定它如何在此处帮助。数据位于SQL Server 2017中,并且有大约一百万个ID。非常欢迎任何建议。
我已添加以下TABLE CREATE脚本,包括真实表格数据,希望这有所帮助:
CREATE TABLE tbl(
  id INT,
  rating int,
  rating_from DATE,
  rating_to DATE
);

INSERT INTO tbl VALUES
  (1, 2, '2014-05-23', '2015-04-13'),
  (1, 2, '2015-04-14', '2015-06-02'),
  (1, 2, '2015-06-03', '2016-05-31'),
  (1, 2, '2016-06-01', '2018-03-22'),
  (2, 1, '2016-06-01', '9999-12-31'),
  (3, 3, '2016-06-01', '9999-12-31'),
  (1, 2, '2018-03-23', '2018-08-06'),
  (1, 3, '2018-08-07', '2018-08-21'),
  (1, 2, '2018-08-22', '2018-09-19'),
  (1, 2, '2018-09-20', '9999-12-31');
4个回答

1
这是一个基于Itzik Ben-Gan的Islands方法的解决方案。它首先找到发生更改的行。生成这些更改的累计总数,以获取每个更改的唯一ID,然后对更改进行分组。这是一种快速而优雅的方法。
With LagAndLead AS
(
SELECT 
ID,Rating,Rating_from,Rating_to
, CASE WHEN     LAG(Rating) OVER (PARTITION BY ID ORDER BY Rating_from) <> Rating 
    THEN 1 
    ELSE 0 
END AS IsStart
FROM tbl
),
Islands AS 
(
SELECT ID,Rating,Rating_from, rating_to
, SUM(IsStart) OVER (PARTITION BY ID ORDER BY Rating_from ROWS UNBOUNDED PRECEDING) AS IslandID
FROM LagAndLead
)
SELECT S.ID,MIN(S.Rating) AS Rating ,min(S.Rating_from) AS Rating_from, max(S.rating_to) AS rating_to
FROM Islands AS S
GROUP BY S.ID,S.IslandID

例子:dbfiddle.uk

这是一个优雅的解决方案。谢谢。它运行得很好。重要的是,您不必担心日期是否完全对齐,您只需要寻找评分值的变化。 - dearsina
如果需要的话,它可以被调整以包括行之间的间隔,但您需要一个额外的滞后来获取前面的结束日期和更多的逻辑来计算IsStart标志。 - BarneyL

1
我发现一种方便的解决方案类似于使用lag()方法。它不是使用lag()方法,而是寻找最大的“至今为止”的数值。
select id, rating, min(rating_from), max(rating_to)
from (select t.*,
             sum(case when dateadd(day, 1, prev_rating_to) >= rating_from then 0 else 1 end) over
                 (partition by id, rating order by rating_from) as grp
      from (select t.*,
                   max(rating_to) over (partition by id, rating
                                        order by rating_from
                                        rows between unbounded preceding and 1 preceding
                                       ) as prev_rating_to
            from tbl t
           ) t
     ) t
group by id, rating, grp
order by id, rating, min(rating_from);

这种方法还考虑了评分截止日期rating_to。因此,即使评分没有变化,它也会找到间隙。

这里是一个db<>fiddle。


0

我刚刚尝试了一个例子,希望与您分享。如果您觉得结果还不错,就可以采用它。

declare @temp as table 
(
    id int,
    rating int,
    rating_from date,
    rating_to date null
);

insert into @temp (id,rating,rating_from,rating_to)values
(1,2,'2010-01-01','2011-01-01'),
(1,2,'2011-01-02','2012-02-01'),
(1,3,'2012-02-02','2013-03-01'),
(1,2,'2013-03-02','2011-01-01'),
(1,2,'2013-04-02',null);

select id,rating,min(rating_from) rating_from,max(Rating_to) rating_to from @temp
group by id,rating
union 
select id,rating,max(rating_from) rating_from,max(Rating_to) rating_to from @temp
where Rating_to is null
group by id,rating
order by rating_from,rating_to


id  rating  rating_from rating_to
1   2   2010-01-01  2012-02-01
1   3   2012-02-02  2013-03-01
1   2   2013-04-02  NULL

希望这对你有所帮助...


这完全取决于具有相同评级的一个期间以NULL结束。如果评级再次回到3,则不适用于更多的分割期间,例如,评级为2的两个单独期间将合并。 - BarneyL
不会合并,因为您可能没有注意到在联合后我获取的max(rating_from),这将按预期工作。 - Mohideen Asraf
不幸的是,除了我最初包含的确切示例之外,这将无法工作。在这里查看一个略有不同但逻辑相同的数据集:https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=2bf6f33adc17b199ea0a821a157c51f2 - dearsina

0

如果以下查询能够给出正确的结果并且有更好的性能,请尝试:

SELECT * FROM
(
SELECT 
  ID, Rating, 
  MIN(Rating_from) AS Rating_from, 
  MAX(Rating_to) AS Rating_to
FROM (

SELECT 
  ID, Rating, Rating_from, Rating_to
  ,ROW_NUMBER() OVER(PARTITION BY ID ORDER BY Rating_from, ID) R_NUM 
  ,ROW_NUMBER() OVER(PARTITION BY Rating, ID ORDER BY Rating_from, ID) R_NUM_Rating
FROM TEST
) AS A
WHERE A.R_NUM = A.R_NUM_Rating OR A.R_NUM_Rating = 1
GROUP BY ID, Rating

UNION ALL

SELECT 
  ID, Rating, 
  MIN(Rating_from) AS Rating_from, 
  MAX(Rating_to) AS Rating_to
FROM (

SELECT 
  ID, Rating, Rating_from, Rating_to
  ,ROW_NUMBER() OVER(PARTITION BY ID ORDER BY Rating_from, ID) R_NUM 
  ,ROW_NUMBER() OVER(PARTITION BY Rating, ID ORDER BY Rating_from, ID) R_NUM_Rating
FROM TEST
) AS A
WHERE A.R_NUM <> A.R_NUM_Rating AND A.R_NUM_Rating <> 1
GROUP BY ID, Rating
) AS FINAL
ORDER BY 3, 1

在我的数据集中,“end”值实际上不是NULL,而是9999-12-31。你的代码如何处理这个? - dearsina
在问题被编辑之前,我已经在这个脚本上工作过了。让我来看看... - VB_isYoung
我添加了一个包含“真实”数据的创建表脚本,或许这会有所帮助。 - dearsina

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接