从MySQL表中删除连续重复的数据

Question

从MySQL表中删除连续重复的数据

4

我有一个MySQL表，存储时间序列数据 - 基本上是来自多个传感器的温度和湿度样本，在（相对）固定的时间间隔内进行采集。

每个传感器的值都会定期存储到表中（以及传感器的ID），无论记录的温度和湿度是否发生变化。由此创建了一个相对较大的表，因此我已将应用程序更新为仅在记录值更改时才为传感器存储一个值 - 当值更改时，它现在将存储两行之前和之后的传感器值（这使图表代码保持不变）。

现在，我想整理旧数据，以删除特定传感器的任何连续重复读数，仅保留表示第一次记录新传感器值或最后一次记录传感器值的行。实际上，这只是删除冗余数据。

我尝试构建此查询，但由于特定单个传感器的主键不是连续的，我很难确定可以删除的重复行。

这更好地说明了数据摘录 - 我已将我要保留的行用粗体标记，并添加了一条说明为什么要保留该特定行。

+-----+----------+---------------------+-------------+----------+
| id  | sensorid | datetime            | temperature | humidity |
+-----+----------+---------------------+-------------+----------+
| 818 | E9       | 2012-10-23 20:59:03 |       20.00 |       72 | E9的第一个温度值
| 819 | C3       | 2012-10-23 20:59:19 |       19.50 |       69 | C3的第一个温度值
| 820 | E9       | 2012-10-23 20:59:47 |       20.00 |       72 |
| 821 | C3       | 2012-10-23 21:00:00 |       19.50 |       69 |
| 822 | E9       | 2012-10-23 21:00:29 |       20.00 |       72 |
| 823 | C3       | 2012-10-23 21:00:41 |       19.50 |       69 |
| 824 | E9       | 2012-10-23 21:01:12 |       20.00 |       72 |
| 825 | C3       | 2012-10-23 21:01:22 |       19.50 |       69 |
| 826 | E9       | 2012-10-23 21:01:55 |       20.00 |       72 |
| 827 | C3       | 2012-10-23 21:02:03 |       19.50 |       69 |
| 828 | E9       | 2012-10-23 21:02:38 |       20.00 |       72 |
| 829 | C3       | 2012-10-23 21:02:44 |       19.50 |       69 |
| 830 | E9       | 2012-10-23 21:03:21 |       20.00 |       72 |
| 831 | C3       | 2012-10-23 21:03:25 |       19.50 |       69 |
| 832 | E9       | 2012-10-23 21:04:04 |       20.00 |       72 |
| 833 | C3       | 2012-10-23 21:04:06 |       19.50 |       69 |
| 834 | EC       | 2012-10-23 21:04:32 |       13.90 |       91 | EC的第一个温度值
| 835 | EC       | 2012-10-23 21:04:32 |       13.90 |       91 |
| 836 | C3       | 2012-10-23 21:04:47 |       19.50 |       69 |
| 837 | E9       | 2012-10-23 21:04:47 |       20.00 |       72 |
| 838 | EC       | 2012-10-23 21:05:11 |       13.90 |       91 |
| 839 | C3       | 2012-10-23 21:05:28 |       19.50 |       69 |
| 840 | E9       | 2012-10-23 21:05:31 |       20.00 |       72 |
| 841 | EC       | 2012-10-23 21:05:50 |       13.90 |       91 |
| 842 | C3       | 2012-10-23 21:06:09 |       19.50 |       69 |
| 843 | E9       | 2012-10-23 21:06:13 |       20.00 |       72 | E9的最后一个温度值为20
| 844 | EC       | 2012-10-23 21:06:29 |       13.90 |       91 |
| 845 | C3       | 2012-10-23 21:06:50 |       19.50 |       69 |
| 846 | E9       | 2012-10-

- Ian Gregory

你实际上需要从基础表中删除“多余”的记录，还是仅仅想要获取一个经过筛选/减少的结果集？ - eggyal

我想要删除它们，但如果我能获取一个减少的结果集，我可以使用它来填充一个替换表。 - Ian Gregory

2个回答

1

基本上，您需要将每个记录连接起来，以检查其前一个和后一个温度：

SELECT t.id,
       t.sensorid,
       t.temperature,
       t.comment,
       prev.id prev_id,
       prev.temperature prev_temp,
       next.id next_id,
       next.temperature next_temp
FROM table1 t
LEFT JOIN table1 prev
  ON prev.id = ( SELECT max(id)
                 FROM table1 t1
                 WHERE t1.sensorid = t.sensorid
                   AND t1.id < t.id
                )
LEFT JOIN table1 next
  ON next.id =  ( SELECT min(id)
                  FROM table1 t1
                  WHERE t1.sensorid = t.sensorid
                    AND t1.id > t.id
                )
ORDER BY t.sensorid, t.id
;

范例 --> http://www.sqlfiddle.com/#!2/297ab/4

使用此查询，您可以获取需要删除的记录，并检查此条件：

current-row-temperature = previous-temperature 
   AND
current-row-temperature = next-temperature

查询如下：

SELECT t.id,
       t.sensorid,
       t.temperature,
       t.comment,
       prev.id prev_id,
       prev.temperature prev_temp,
       next.id next_id,
       next.temperature next_temp
FROM table1 t
LEFT JOIN table1 prev
  ON prev.id = ( SELECT max(id)
                 FROM table1 t1
                 WHERE t1.sensorid = t.sensorid
                   AND t1.id < t.id
                )
LEFT JOIN table1 next
  ON next.id =  ( SELECT min(id)
                  FROM table1 t1
                  WHERE t1.sensorid = t.sensorid
                    AND t1.id > t.id
                )
WHERE t.temperature = prev.temperature
  AND t.temperature = next.temperature
ORDER BY t.sensorid, t.id
;

查询可以用作多表删除中的子查询：

DETELE table1 t1, 
(
   the above query 
) x1
WHERE t1.id = x1.id

您也可以否定该条件，以仅检索要保留的记录。

SELECT t.id,
       t.sensorid,
       t.temperature,
       t.comment,
       prev.id prev_id,
       prev.temperature prev_temp,
       next.id next_id,
       next.temperature next_temp
FROM table1 t
LEFT JOIN table1 prev
  ON prev.id = ( SELECT max(id)
                 FROM table1 t1
                 WHERE t1.sensorid = t.sensorid
                   AND t1.id < t.id
                )
LEFT JOIN table1 next
  ON next.id =  ( SELECT min(id)
                  FROM table1 t1
                  WHERE t1.sensorid = t.sensorid
                    AND t1.id > t.id
                )
WHERE t.temperature <> prev.temperature
   OR t.temperature <> next.temperature
   OR prev.temperature IS NULL
   OR next.temperature IS NULL
ORDER BY t.sensorid, t.id
;

您可以使用此查询将选定的记录复制到新表中：

CREATE TABLE new_table AS
SELECT t.*
FROM table1 t
LEFT JOIN table1 prev
  ON prev.id = ( SELECT max(id)
                 FROM table1 t1
                 WHERE t1.sensorid = t.sensorid
                   AND t1.id < t.id
                )
LEFT JOIN table1 next
  ON next.id =  ( SELECT min(id)
                  FROM table1 t1
                  WHERE t1.sensorid = t.sensorid
                    AND t1.id > t.id
                )
WHERE t.temperature <> prev.temperature
   OR t.temperature <> next.temperature
   OR prev.temperature IS NULL
   OR next.temperature IS NULL
ORDER BY t.sensorid, t.id
;

- krokodilko

SO的一大优点是在回答问题的几分钟内获得两个优秀的答案。谢谢。 - Ian Gregory

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- eggyal · Accepted Answer

使用用户变量来跟踪整个表按sensorid和datetime排序时sensorid，temperature和humidity列的“last”值，可以确定每个记录属于哪个“组”，然后在此基础上进行聚合。

SELECT sensorid, temperature, humidity,
       MIN(datetime) dt_min, MAX(datetime) dt_max
FROM (
  SELECT   datetime,
           @group := @group + IF(
             @last_sensor   <=> sensorid
         AND @last_temp     <=> temperature
         AND @last_humidity <=> humidity
           , 0, 1) gp,
             @last_sensor   :=  sensorid    sensorid,
             @last_temp     :=  temperature temperature,
             @last_humidity :=  humidity    humidity
  FROM     my_table, (SELECT
             @group         :=  0,
             @last_sensor   :=  NULL,
             @last_temp     :=  NULL,
             @last_humidity :=  NULL
           ) init
  ORDER BY sensorid, datetime
) t GROUP BY t.gp

在 sqlfiddle 上查看。

使用此查询可以执行反连接，从原始表中删除所有其他记录：

DELETE my_table.*
FROM   my_table LEFT JOIN (
         <above query>
       ) x
   ON  my_table.sensorid    = x.sensorid
   AND my_table.temperature = x.temperature
   AND my_table.humidity    = x.humidity
   AND my_table.datetime IN (x.dt_min, x.dt_max)
WHERE x.sensorid IS NULL

请在sqlfiddle上查看。

请注意，如果从同一传感器在相同的datetime下获取了两个（相同的）读数，则不清楚应保留/删除哪个记录（特别是因为您指出“对于特定单个传感器，主键不连续”）：因此上述查询将保留记录id = 835。