仅删除连续重复的行

6
我已经从API收集了数据以建立历史记录。初始时,我每五分钟保存所有值。后来,我改变了我的程序,只保存发生变化的数据。
现在,我想清理旧数据并删除所有计数没有从同一账户(account)id的上一条记录中变化的值。
account id      count   time
42      12147   492     2015-09-20 11:31:14.0
42      12147   492     2015-09-20 11:36:19.0 // delete
13      12147   246     2015-09-20 11:31:14.0
2       12253   183     2015-09-20 11:36:19.0
2       19684   805     2015-09-20 12:00:41.0 // note in next comment
2       19684   810     2015-09-20 12:05:41.0
2       19684   805     2015-09-20 12:10:41.0 // we had this combination, but don't delete this record because the previous value was different
2       19684   805     2015-09-20 12:15:41.0 // delete
2       19684   805     2015-09-20 12:20:41.0 // delete
2       19684   806     2015-09-20 12:25:41.0

我尝试使用group byaccountidcount进行解决。然而,这种方法会删除非连续的重复记录——也就是说,如果某个记录在一段时间后再次出现相同的值,则会落入同一组。

我还考虑编写一个小脚本,遍历所有数据,并在当前行的accountidcount与上一条记录相同时删除它,但我很好奇是否可以用一个单独的SQL语句来实现?


第三行是另一个“账户”,因此它是按升序排列的,对吗? - das Keks
2个回答

4
您可以使用以下查询:
DELETE history 
FROM history 
INNER JOIN (SELECT MIN(time) AS minTime, account, id, count
            FROM history
            GROUP BY account, id, count) AS h
ON history.account = h.account AND history.id = h.id AND history.count = h.count
WHERE history.time > h.minTime

这里有演示

编辑:

在编辑后,我认为原始问题中的样本数据仍存在一些错误(time字段应按升序排列)。

假设表格中存在一个主键,您可以使用以下查询:

SELECT pk
FROM history AS h1
WHERE account = (SELECT account 
                 FROM history AS h2
                 WHERE h1.account = h2.account AND
                       h1.id = h2.id AND                       
                       h2.time < h1.time
                 ORDER BY time DESC 
                 LIMIT 1)
      AND
      id = (SELECT id 
            FROM history AS h2
            WHERE h1.account = h2.account AND
                  h1.id = h2.id AND                  
                  h2.time < h1.time
            ORDER BY time DESC 
            LIMIT 1)
      AND
      count = (SELECT count
               FROM history AS h2
               WHERE h1.account = h2.account AND
                     h1.id = h2.id AND                     
                     h2.time < h1.time
               ORDER BY time DESC 
               LIMIT 1)

为了标识出需要删除的记录(参见此演示),现在可以使用NOT IN运算符轻松删除不必要的行。
DELETE FROM history 
WHERE pk IN (
SELECT x.pk
FROM (             
  SELECT pk
  FROM history AS h1
  WHERE 
     account = (SELECT account 
                FROM history AS h2
                WHERE h1.account = h2.account AND
                      h1.id = h2.id AND                       
                      h2.time < h1.time
                      ORDER BY time DESC 
                      LIMIT 1)

     AND

     id = (SELECT id 
           FROM history AS h2
           WHERE h1.account = h2.account AND
                 h1.id = h2.id AND                  
                 h2.time < h1.time
           ORDER BY time DESC 
           LIMIT 1)

     AND

     count = (SELECT count
              FROM history AS h2
              WHERE h1.account = h2.account AND
                    h1.id = h2.id AND                     
                    h2.time < h1.time
              ORDER BY time DESC 
              LIMIT 1)) AS x)

点击此处查看演示

编辑 2:

使用变量来定位待删除的pk值可能会导致查询速度更快:

SELECT pk
FROM (
  SELECT pk, account, id, count, time,
         @rn := IF (account = @acc AND id = @id AND count = @count,
                    @rn + 1, 1) AS rn,
         @acc := account,
         @id := id,
         @count := count
  FROM history
  CROSS JOIN (SELECT @rn = 0, @acc = 0, @id = 0, @count = 0) AS vars
  ORDER BY account, id, time, count ) AS t
WHERE t.rn > 1

Demo here


演示很棒。但是账户2的ID19684存在问题,它从805上升到810,然后又回到了805。这些都是有效的更改。但是在810之后只有接下来的两个805应该被删除,而不是第一个805,因为在810之后还有806。 - das Keks
1
@dasKeks 我已经编辑了我的答案,现在应该可以工作了。 - Giorgos Betsos
1
我添加了一个主键,一般来说这似乎是有效的。但是在一个有 700,000 行的表中,每行有三个子查询的语句可能不太快。我希望能够查看单个 SQL 语句的进度。 - das Keks
@dasKeks请检查我所做的修改。与以前的查询相比,这个查询是否执行得更快? - Giorgos Betsos
确定所有主键大约需要70秒,这相当快。删除需要更多时间,但我认为这已经是最快的了。谢谢 :) - das Keks
显示剩余2条评论

0

你可以使用以下(未经测试的)代码删除除第一个之外的所有内容:

delete from history h1 
where exists (select h2 
              from history 
              where
                h1.account = h2.account and
                h1.id = h2.id and
                h1.count = h2.count and
                h1.time < h2.time
             )

1
我认为应该是 h1.time *>* h2.time,因为 OP 想要保留较旧的记录。 - Juan Carlos Oropeza
除非我漏掉了什么,否则这似乎只是删除_重复项_,而不管它们是否_连续_。 连续部分是使这个问题棘手的地方 - 这与之前的十几个Stack Overflow问题不同。 - Jeremy Caney

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接