删除重复项 - 仅保留除最近日期行外的所有行

4

可能是重复内容:
如何在SQL中查找重复条目并删除最旧的条目?

由于错误的更新工具,我的数据库有几千个重复项。我能够识别出有重复的项目集合,但需要仅删除最旧的条目,而不一定是最低id。测试数据如下,正确的行有一个星号*

除了具有重复ruleids的重复标题文章应被删除之外,还应保留最近创建的行。(实际id列是GUID,因此无法假设自动增量)

Id           Article id          Rule Id         Title          Opened Date
--           ----------          -------         -----          -----------
1*           111                 5               T1             2013-01-20
2            112                 5               T1             2013-07-01
3*           113                 6               T2             2013-07-01
4*           114                 7               T2             2013-07-02
5            115                 8               T3             2012-07-01
6            116                 8               T3             2013-01-20
7*           117                 8               T3             2013-01-21           

表结构:
CREATE TABLE [dbo].[test_ai](
    [id] [int] NOT NULL,
    [ArticleId] [varchar](50) NOT NULL,
    [ruleid] [varchar](50) NULL,
    [Title] [nvarchar](max) NULL,
    [AuditData_WhenCreated] [datetime] NULL,
PRIMARY KEY CLUSTERED 
(
    [id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)

测试数据插入

insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (1, 111, 5, 'test 1', '2013-01-20')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (2, 112, 5, 'test 1', '2012-07-01')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (3, 113, 6, 'test 2', '2012-07-01')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (4, 114, 7, 'test 2', '2012-07-02')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (5, 115, 8, 'test 3', '2012-07-01')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (6, 116, 8, 'test 3', '2013-01-20')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (7, 117, 8, 'test 3', '2013-01-21')

我的当前查询看起来像这样

select * from test_ai
where test_ai.id in

-- set 1 - all rows with duplicates
(select f.id 
from test_ai as F 
WHERE exists (select ruleid, title, count(id)   
FROM test_ai
    WHERE test_ai.title = F.title
        AND test_ai.ruleid = F.ruleid
    GROUP BY test_ai.title, test_ai.ruleid
    having count(test_ai.id) > 1))
    and test_ai.id not in

-- set 2 - includes one row from each set of duplicates
(select min(id)
from test_ai as F
WHERE EXISTS (select ruleid, title, count(id)
from test_ai
WHERE test_ai.title = F.title 
    AND test_ai.ruleid = F.ruleid
group by test_ai.title, test_ai.ruleid
HAVING count(test_ai.id) > 1)   
GROUP BY title, ruleid
)   

这个SQL语句识别出了应该删除的某些行(第2、6、7行),但它选择了“打开日期”最早的文章进行删除(应该删除第2、5、6行)。我意识到我没有将这个要求指定给语句,但我不知道该如何添加。如果需要运行多次脚本以删除重复项,则不是问题。
实际问题要复杂得多,但如果我能解决这个关键部分,我就能再次前进。谢谢您的帮助!

测试 Fiddle - mellamokb
我认为这会对你有所帮助:http://jzinedine.me/post/30604785957/a-flexible-way-to-delete-duplicate-rows-in-sql - Jahan Zinedine
根据您想要删除的描述行,这个问题的标题不应该是“仅保留最新行”或“删除除最新行之外的所有行”吗?目前,标题与您的实际要求不符。 - Aaron Bertrand
@AaronBertrand 同意,已调整。谢谢。 - Chris Ballance
1个回答

4

在SQL Server 2005+中,从集合中删除一行(或从集合中的每个组中删除一行)的典型模型是:

;WITH cte AS 
(
  SELECT col, rn = ROW_NUMBER() OVER 
    (PARTITION BY something ORDER BY something)
  FROM dbo.base_table
  WHERE ...
)
DELETE x WHERE rn = 1;

在您的情况下,应该是这样的:
;WITH cte AS 
(
  SELECT id, ruleid, Title, rn = ROW_NUMBER() OVER 
  (
     PARTITION BY ruleid, Title  
     ORDER BY auditdata_whencreated DESC
  )
  FROM dbo.test_ai
)
DELETE cte 
  OUTPUT deleted.id
  WHERE rn > 1;

结果:

id
----
2
6
5

@mellamokb 但我认为这个答案更简单,就像我在问题评论中提到的那样。 - Jahan Zinedine
@Jani:我们的答案不都是完全相同的想法吗? - mellamokb
1
@mellamokb 嘿,伙计!你在此期间更新了这个代码片段 :-D - Jahan Zinedine
同时找到类似问题,并参考https://dev59.com/80bRa4cB1Zd3GeqP5f2N的示例,以此为基础进行操作。 - Chris Ballance
1
如果你的fiddle与我发布的内容相同,那么第一次对话是关于什么的呢?“我也是!我也是!” - Aaron Bertrand

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接