如何在SQL Server中分组相似的行

4

我有这样的一个表格:

Date        ConfigID    ItemID    ClientName    Metric1    Metric2
====        ========    ======    ==========    =======    =======
2017-01-01  1           1         A             2.0        2.0
2017-01-01  3           1         A             2.0        2.0
2017-01-01  4           2         B             5.0        5.0
2017-01-02  4           3         A             6.0        6.0
2017-01-01  2           1         A             2.0        2.0
....
(20 million rows here)

我想在这里基于DateItemIDClientNameMetric1Metric2检测重复项,所以我写了以下代码:

CREATE TABLE MyTable ([Date] date,
                      ConfigID int,
                      ItemID int,
                      ClientName char(1),
                      Metric1 decimal(3,1),
                      Metric2 decimal(3,1));
INSERT INTO MyTable
VALUES ('2017-01-01',1,1,'A',2.0,2.0),
       ('2017-01-01',3,1,'A',2.0,2.0),
       ('2017-01-01',4,2,'B',5.0,5.0),
       ('2017-01-02',4,3,'A',6.0,6.0),
       ('2017-01-01',2,1,'A',2.0,2.0);    

WITH Dupes          
AS (            
    SELECT *        
        ,ROW_NUMBER() OVER (    
            PARTITION BY 
                [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2]
            ORDER BY [Date] DESC
    ) AS RowNum 
    FROM myTable)

SELECT *
FROM Dupes

但是这会返回类似于这样的内容:
Date        ConfigID    ItemID    ClientName    Metric1    Metric2    RowNum
====        ========    ======    ==========    =======    =======    ======
2017-01-01  1           1         A             2.0        2.0        1
2017-01-01  3           1         A             2.0        2.0        2
2017-01-01  4           2         B             5.0        5.0        1
2017-01-02  4           3         A             6.0        6.0        1
2017-01-01  2           1         A             2.0        2.0        3
....
(20 million rows here)

我希望根据 PARTITION BY 子句将相似的项目分组。换句话说,我想看到类似这样的内容(我不需要 RowNum):

Date        ConfigID    ItemID    ClientName    Metric1    Metric2    RowNum
====        ========    ======    ==========    =======    =======    ======
2017-01-01  1           1         A             2.0        2.0        1
2017-01-01  3           1         A             2.0        2.0        2
2017-01-01  2           1         A             2.0        2.0        3
2017-01-01  4           2         B             5.0        5.0        1
2017-01-02  4           3         A             6.0        6.0        1
....
(20 million rows here)

有哪些SQL查询语句可以帮助我将表中重复/相似的行分组?感谢您提供建议和答案!


“group”它们是什么意思?您想在数据集中不返回它们吗?聚合它们?还是其他什么? - Thom A
@Larnu,我的意思是我想手动审核它们,看看哪里不同(例如,我可以看到其他列中有多少个不同的ConfigID具有相同的值)。 - user1330974
4个回答

1

只需在选择语句中加入order by。

;           
WITH Dupes          
AS (            
    SELECT *        
        ,ROW_NUMBER() OVER (    
            PARTITION BY 
                [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2]
            ORDER BY [Date] DESC
    ) AS RowNum 
    FROM myTable)

SELECT *
FROM Dupes
order by [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2],
RowNum 

1

我认为你只需要使用order by。而且CTE是不必要的:

. . .
SELECT *
FROM Dupes
ORDER BY [Date], [ItemID], [ClientName], [Metric1], [Metric2];

1
使用 DENSE_RANK 而不是 ROW_NUMBER 有助于吗?
;           
WITH Dupes          
AS (            
    SELECT *        
        ,DENSE_RANK ( )
        OVER (    
            ORDER BY
                [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2]           
             DESC
    ) AS GroupID 
    FROM myTable)

SELECT *
FROM Dupes

这里提出的解决方案:
;           
WITH D1          
AS (            
    SELECT *        
        ,DENSE_RANK ( )
        OVER (    
            ORDER BY
                [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2]           
             DESC
    ) AS GroupID 
    FROM myTable)
, Dupes AS (
    SELECT *
        , COUNT(*) OVER (PARTITION BY GroupID) AS GroupItemsCount
    FROM D1
)
SELECT *
FROM Dupes
WHERE GroupItemsCount <> 1

但更好的方法可能是:
;           
WITH Dupes          
AS (            
    SELECT *        
        ,COUNT(*)
        OVER (    
            partition BY
                [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2]           
    ) AS GroupItemsCount
    FROM myTable)

SELECT *
FROM Dupes
WHERE GroupItemsCount > 1

是的!DENSE_RANK 接近我想要的。但是如何从上面的查询中提取具有多行的给定排名的数据呢?例如,假设三行的排名为“1”,其余行都分配了唯一的排名(2...n)。然后我只想提取/查看排名为“1”的那个。哪些附加查询会有所帮助呢?谢谢! - user1330974
1
我现在不在电脑旁,但我建议你尝试在另一个CTE中使用COUNT() OVER(PARTITION BY GroupID)作为GroupItemsCount,并过滤掉GroupItemsCount > 1。 - Ezin82
是的!COUNT(*) OVER (PARTITION BY....) 并且基于 count > 1 进行过滤,以及按照 PARTITION BY 子句中的所有条件进行排序都可以正常工作!非常感谢你! - user1330974

1

根据 Luca 在评论中的建议,使用 COUNT(*) PARTITION BY(...) 似乎可以解决问题:

CREATE TABLE MyTable ([Date] date,
                      ConfigID int,
                      ItemID int,
                      ClientName char(1),
                      Metric1 decimal(3,1),
                      Metric2 decimal(3,1));
INSERT INTO MyTable
VALUES ('2017-01-01',1,1,'A',2.0,2.0),
       ('2017-01-01',3,1,'A',2.0,2.0),
       ('2017-01-01',4,2,'B',5.0,5.0),
       ('2017-01-02',4,3,'A',6.0,6.0),
       ('2017-01-01',2,1,'A',2.0,2.0);    

WITH Dupes          
AS (            
    SELECT *        
        ,COUNT(*) OVER (    
            PARTITION BY 
                [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2]
            ORDER BY [Date] DESC
    ) AS DupeCount 
    FROM myTable)

SELECT *
FROM Dupes
WHERE DupeCount > 1

同意,@Forty3。它有效。感谢您撰写答案。我发现在“WHERE DupeCount>1”之后添加“ORDER BY [Date] ,[ItemID],[ClientName],[Metric1],[Metric2]”可以帮助我轻松查看重复项。 :) - user1330974

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接