MySQL删除重复记录但保留最新的记录。

91

我有唯一的 idemail 字段。Emails 会重复。 我只想保留所有重复Email地址中最新的一个,即最后插入的记录的 id

我该如何实现这个?


1
你尝试过搜索吗?可能是重复的问题 how-do-i-delete-duplicate-rows-and-keep-the-first-row - sra
1
为什么不直接防止重复插入到表中?将电子邮件设置为唯一索引。 - tofutim
@sra,老铁,我有大概20个线程还没关闭,但是我不是数据库专家,所有这些线程都有一些条件,使得查询变得非常难以理解,所以我开了一个新的线程,并表示了很多道歉。 @tofutim:Tim,我们从第三方获取了这些数据,所以选择余地不大。因此现在正在清理数据。 :-) - Khuram
11个回答

177

假设您的表格名为test,其中包含以下数据:

  select id, email
    from test;

ID                     EMAIL                
---------------------- -------------------- 
1                      aaa                  
2                      bbb                  
3                      ccc                  
4                      bbb                  
5                      ddd                  
6                      eee                  
7                      aaa                  
8                      aaa                  
9                      eee 

因此,我们需要找到所有重复的电子邮件并删除它们,但保留最新的id。
在此例中,aaabbbeee 都是重复的,因此我们希望删除ID为 1、7、2 和 6。

要实现这一点,首先我们需要找到所有重复的电子邮件:

      select email 
        from test
       group by email
      having count(*) > 1;

EMAIL                
-------------------- 
aaa                  
bbb                  
eee  

然后,我们需要从这个数据集中找出每个重复电子邮件的最新ID:

  select max(id) as lastId, email
    from test
   where email in (
              select email 
                from test
               group by email
              having count(*) > 1
       )
   group by email;

LASTID                 EMAIL                
---------------------- -------------------- 
8                      aaa                  
4                      bbb                  
9                      eee                                 

最后,我们现在可以删除所有ID小于LASTID的电子邮件。所以解决方案是:

delete test
  from test
 inner join (
  select max(id) as lastId, email
    from test
   where email in (
              select email 
                from test
               group by email
              having count(*) > 1
       )
   group by email
) duplic on duplic.email = test.email
 where test.id < duplic.lastId;

目前这台机器上没有安装MySQL,但应该能够正常工作。

更新

上述的删除操作是有效的,但我找到了一个更优化的版本:

 delete test
   from test
  inner join (
     select max(id) as lastId, email
       from test
      group by email
     having count(*) > 1) duplic on duplic.email = test.email
  where test.id < duplic.lastId;

你可以看到它删除了最旧的重复项,即1、7、2、6:

select * from test;
+----+-------+
| id | email |
+----+-------+
|  3 | ccc   |
|  4 | bbb   |
|  5 | ddd   |
|  8 | aaa   |
|  9 | eee   |
+----+-------+

另一个版本,是由Rene Limon提供的删除方法。

delete from test
 where id not in (
    select max(id)
      from test
     group by email)

你优化后的查询正是我一直在寻找的,不过如果没有进一步的改进空间,它会变得非常慢。 - Fr0zenFyr
@Fr0zenFyr 很可能你需要为对应 idemail 列创建索引。如果你已经创建了这些索引,请确保通过运行 ANALYZE TABLE 更新统计信息。你可以运行 EXPLAIN SELECT 来检查出错的原因,参见 http://dev.mysql.com/doc/refman/5.0/en/using-explain.html。 - Jose Rui Santos
25
可以这样写:DELETE FROM test WHERE id NOT IN (SELECT MAX(id) FROM test GROUP BY email) - Rene Limon
5
我收到了错误提示:Table 'test' is specified twice, both as a target for 'DELETE' and as a separate source for data。这表示表“test”被同时指定为“DELETE”的目标和数据的独立来源,导致了错误。 - Hamman Samuel
13
请尝试使用嵌套子查询,使MySQL对其进行实现,不再使用“同一张表”,例如使用delete from test where id not in ( SELECT * FROM (select max(id) from test group by email) AS S) (我已添加大写部分)。 - Jose Rui Santos
显示剩余9条评论

63

试试这个方法

DELETE t1 FROM test t1, test t2 
WHERE t1.id > t2.id AND t1.email = t2.email

创建一个包含2列的表:id(主键)和email(包含重复值),然后运行此查询,您将得到结果。这是从同一张表进行的自连接,通过保留一份副本来删除重复记录。 - Pulkit Malhotra
5
不确定为什么这个东西一直隐藏在页面下方。它简单而有效。 - slightlyfaulty
2
这真的保留了最新的吗?最新的具有最高的id,而且看起来这个查询正在删除任何大于其他idid。请参见@TanvirChowdhury在https://dev59.com/bW025IYBdhLWcg3wRTxK#63434018的答案。 - Daniel F
1
只需要交换 t2.id > t1.id 来保留更大的那个(t2)。 - Hoang Tran
2
确保你的列已经建立索引(例如id和email)。否则,如果你有成千上万或数百万条记录,它将需要几分钟(甚至几小时)才能完成。 - Luis Rodriguez
这会保留最小的id吗? where t1.id < t2.id 会保留最后一个条目吗?谢谢分享技巧 :) - Antony Gibbs

19

正确的方式是

DELETE FROM `tablename`
  WHERE `id` NOT IN (
    SELECT * FROM (
      SELECT MAX(`id`) FROM `tablename`
        GROUP BY `name`
    ) 
  )

“x” 字符的目的是什么? - Codex73
4
运行此查询会出现错误 1248 (42000): 每个派生表都必须有自己的别名。在此添加一个别名称为 DTAB 可以解决此问题:DELETE FROM tablename WHERE id NOT IN (SELECT * FROM (SELECT MAX(id) FROM tablename GROUP BY name) as DTAB)。 - Mr Ed
1
为了解决错误1248,我不得不在GROUP BY变量后的第一个闭括号之后添加AS x;答案中的第6行。 - Alex P. Miller

6
如果您想要保留id值最小的行:
DELETE n1 FROM 'yourTableName' n1, 'yourTableName' n2 WHERE n1.id > n2.id AND n1.email = n2.email

如果您想保留具有最高id值的行:

DELETE n1 FROM 'yourTableName' n1, 'yourTableName' n2 WHERE n1.id < n2.id AND n1.email = n2.email

或者这个查询也可能有所帮助

DELETE FROM `yourTableName` 
  WHERE id NOT IN (
    SELECT * FROM (
      SELECT MAX(id) FROM yourTableName 
        GROUP BY name
    ) 
  )

4
DELETE 
FROM
  `tbl_job_title` 
WHERE id NOT IN 
  (SELECT 
    * 
  FROM
    (SELECT 
      MAX(id) 
    FROM
      `tbl_job_title` 
    GROUP BY NAME) tbl)

修订后的、可工作版本!!! 感谢@Gaurav


2

我个人对前两个得票最高的答案有困难。这并不是最干净的解决方案,但你可以利用临时表来避免 MySQL 在删除通过在同一张表上进行连接时出现的所有问题。

CREATE TEMPORARY TABLE deleteRows;
SELECT MIN(id) as id FROM myTable GROUP BY myTable.email;

DELETE FROM myTable
WHERE id NOT IN (SELECT id FROM deleteRows);

1

我想在表中基于多个列删除重复记录,所以这种方法适合我,

步骤1 - 从重复记录中获取最大id或唯一id

select *  FROM ( SELECT MAX(id) FROM table_name 
group by travel_intimation_id,approved_by,approval_type,approval_status having 
count(*) > 1

步骤2 - 从表中获取单个记录的ID

select *  FROM ( SELECT id FROM table_name 
group by travel_intimation_id,approved_by,approval_type,approval_status having 
count(*) = 1

步骤3 - 从删除中排除以上2个查询

DELETE FROM `table_name` 
WHERE 
id NOT IN (paste step 1 query) a //to exclude duplicate records
and 
id NOT IN (paste step 2 query) b // to exclude single records

最终查询:

DELETE FROM `table_name` 

WHERE id NOT IN (

select *  FROM ( SELECT MAX(id) FROM table_name 
group by travel_intimation_id,approved_by,approval_type,approval_status having 
count(*) > 1) a 
)
and id not in (

select *  FROM ( SELECT id FROM table_name 
group by travel_intimation_id,approved_by,approval_type,approval_status having 
count(*) = 1) b
);

通过这个查询,只有重复的记录会被删除。


1
我必须说,优化版本是一段非常漂亮、优雅的代码,并且即使在对DATETIME列进行比较时也能完美运行。这是我在我的脚本中使用的,我在其中搜索每个EmployeeID的最新合同结束日期:
DELETE CurrentContractData
  FROM CurrentContractData
  INNER JOIN (
    SELECT
      EmployeeID,
      PeriodofPerformanceStartDate,
      max(PeriodofPerformanceEndDate) as lastDate,
      ContractID
    FROM CurrentContractData
    GROUP BY EmployeeID
    HAVING COUNT(*) > 1) Duplicate on Duplicate.EmployeeID = CurrentContractData.EmployeeID
    WHERE CurrentContractData.PeriodofPerformanceEndDate < Duplicate.lastDate;

许多感谢!

我遇到了类似的问题,尝试了将您的查询重建到我的表中。我注意到 DELETE nytable FROM mytable ..... 不起作用,所以我将前两行改为 DELETE from mytable WHERE id IN (SELECT id FROM mytable .... - Radon8472

0
请尝试以下解决方案(基于“@Jose Rui Santos”答案的评论):
-- Set safe mode to false since;
-- You are using safe update mode and tried to update a table without a WHERE that uses a KEY column
SET SQL_SAFE_UPDATES = 0;

-- Delete the duplicate rows based on the field_with_duplicate_values 
-- Keep the unique rows with the highest id
DELETE FROM table_to_deduplicate
WHERE id NOT IN (
    SELECT * FROM (
        -- Select the highest id grouped by the field_with_duplicate_values
        SELECT MAX(id)
        FROM table_to_deduplicate
        GROUP BY field_with_duplicate_values
    )
    -- Subquery and alias needed since;
    -- You can't specify target table 'table_to_deduplicate' for update in FROM clause
    AS table_sub
);

-- Set safe mode to true
SET SQL_SAFE_UPDATES = 1;

0
delete  from iamsmsaccountmetadata where 
 id not in (select del.id from ( select iid,max(id) as id
 from iam.iamsmsaccountmetadata
 group by iid
 having count(*) > 1) as del )

这是经过验证和测试的确切方式。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接