优化MySQL的范围查询和分组统计查询

4
我有一个包含每天温度的表格(巨大的表格),以及一个包含周期开始和结束日期的表格(小表格)。现在我想知道每个周期的平均温度,但是查询时间很长。是否可以改进?
注意:升级到版本5.6.19-1~exp1ubuntu2后,响应时间变短,并且可能是由于MySQL 5.6.8之前的版本存在错误(请参见Quassnoi的评论)。
重新构建包含随机数据的日和周期表:
create table days (
  day int not null auto_increment primary key,
  temperature float not null);

insert into days values(null,rand()),(null,rand()),
  (null,rand()),(null,rand()),(null,rand()),(null,rand()),
  (null,rand()),(null,rand()); # 8 rows

insert into days select null, d1.temperature
  from days d1, days d2, days d3, days d4,
  days d5, days d6, days d7; # 2M rows

create table periods(id int not null auto_increment primary key,
  first int not null,
  last int not null,
  index(first) using btree,
  index(last) using btree,
  index(first,last) using btree);

# add 10 periods of 1-11 days each
insert into periods(first,last)
  select floor(rand(day)*2000000), floor(rand(day)*2000000 + rand()*10)
  from days limit 10;

列出每个时间段的每天气温是没有问题的(返回结果在1毫秒内):

select id, temperature
  from periods join days on day >= first and day <= last;

现在,使用GROUP BY 进行查询速度非常缓慢(约为1750毫秒)。
# ALT1
select id, avg(temperature)
  from periods join days on day >= first and day <= last group by id;

将 <= 和 >= 替换为 BETWEEN 可以略微加快速度(约1600毫秒):

# ALT2
select id, avg(temperature)
  from periods join days on day between first and last group by id;

原来,单个时间段的结果可以立即返回(1毫秒):
select id, (select avg(temperature)
  from days where day >= first and day <= last) from periods
  where id=1;

然而,如果没有WHERE条件,查询需要长达4200毫秒,平均每个周期需要420毫秒!

# ALT3
select id,
  (select avg(temperature) from days where day >= first and day <= last)
  from periods;

查询为什么如此缓慢,甚至比获取单个时间段结果要慢(多达10倍),尽管“periods”表只有10行?是否有任何方法可以优化此查询?
编辑:更多信息:
mysql> select @@version;
+-------------------------+
| @@version               |
+-------------------------+
| 5.5.41-0ubuntu0.14.04.1 |
+-------------------------+

# ALT1
mysql> explain select id, avg(temperature) from periods join days on day >= first and day <= last group by id;
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+----------------------------------------------+
| id | select_type | table   | type  | possible_keys      | key     | key_len | ref  | rows    | Extra                                        |
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+----------------------------------------------+
|  1 | SIMPLE      | periods | index | first,last,first_2 | first_2 | 8       | NULL |      10 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | days    | ALL   | PRIMARY            | NULL    | NULL    | NULL | 2097596 | Using where; Using join buffer               |
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+----------------------------------------------+

# ALT1 without GROUP BY
mysql> explain select id, temperature from periods join days on day >= first and day <= last;
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+------------------------------------------------+
| id | select_type | table   | type  | possible_keys      | key     | key_len | ref  | rows    | Extra                                          |
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+------------------------------------------------+
|  1 | SIMPLE      | periods | index | first,last,first_2 | first_2 | 8       | NULL |      10 | Using index                                    |
|  1 | SIMPLE      | days    | ALL   | PRIMARY            | NULL    | NULL    | NULL | 2097596 | Range checked for each record (index map: 0x1) |
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+------------------------------------------------+

# ALT2
mysql> explain select id, avg(temperature) from periods join days on day between first and last group by id;
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+----------------------------------------------+
| id | select_type | table   | type  | possible_keys      | key     | key_len | ref  | rows    | Extra                                        |
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+----------------------------------------------+
|  1 | SIMPLE      | periods | index | first,last,first_2 | first_2 | 8       | NULL |      10 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | days    | ALL   | PRIMARY            | NULL    | NULL    | NULL | 2097596 | Using where; Using join buffer               |
+----+-------------+---------+-------+--------------------+---------+---------+------+---------+----------------------------------------------+

# ALT3
mysql> explain select id, (select avg(temperature) from days where day >= first and day <= last) from periods;
+----+--------------------+---------+-------+---------------+---------+---------+------+---------+-------------+
| id | select_type        | table   | type  | possible_keys | key     | key_len | ref  | rows    | Extra       |
+----+--------------------+---------+-------+---------------+---------+---------+------+---------+-------------+
|  1 | PRIMARY            | periods | index | NULL          | first_2 | 8       | NULL |      10 | Using index |
|  2 | DEPENDENT SUBQUERY | days    | ALL   | PRIMARY       | NULL    | NULL    | NULL | 2097596 | Using where |
+----+--------------------+---------+-------+---------------+---------+---------+------+---------+-------------+

# ALT3 with where
mysql> explain select id, (select avg(temperature) from days where day >= first and day <= last) from periods where id = 1;
+----+--------------------+---------+-------+---------------+---------+---------+-------+------+-------------+
| id | select_type        | table   | type  | possible_keys | key     | key_len | ref   | rows | Extra       |
+----+--------------------+---------+-------+---------------+---------+---------+-------+------+-------------+
|  1 | PRIMARY            | periods | const | PRIMARY       | PRIMARY | 4       | const |    1 |             |
|  2 | DEPENDENT SUBQUERY | days    | range | PRIMARY       | PRIMARY | 4       | NULL  |   10 | Using where |
+----+--------------------+---------+-------+---------------+---------+---------+-------+------+-------------+

编辑2:根据Lennart的建议,嵌套查询FROM的执行计划(查询执行时间为3毫秒)

mysql> explain select id,avg(temperature) from (select id,temperature from periods join days on day between first and last) as t group by id;
+----+-------------+------------+-------+--------------------+---------+---------+------+----------+------------------------------------------------+
| id | select_type | table      | type  | possible_keys      | key     | key_len | ref  | rows     | Extra                                          |
+----+-------------+------------+-------+--------------------+---------+---------+------+----------+------------------------------------------------+
|  1 | PRIMARY     | <derived2> | ALL   | NULL               | NULL    | NULL    | NULL |       50 | Using temporary; Using filesort                |
|  2 | DERIVED     | periods    | index | first,last,first_2 | first_2 | 8       | NULL |       10 | Using index                                    |
|  2 | DERIVED     | days       | range | PRIMARY,day        | PRIMARY | 4       | NULL |        5 | Range checked for each record (index map: 0x3) |
+----+-------------+------------+-------+--------------------+---------+---------+------+----------+------------------------------------------------+

1
请在您的查询中使用EXPLAIN并展示结果。 - O. Jones
非常好的问题,使用工作查询来重现问题。大家都像wau一样去做! - Quassnoi
周期的大小可以因创建方式而大不相同。我猜想,如果您查看正在处理的天数,您就会明白为什么 id = 1 比其他 id 更快。 - Gordon Linoff
@wau:你使用的是哪个版本的MySQL? - Quassnoi
1
@GordonLinoff:周期的大小为0到9天。这是MySQL中的一个错误:http://bugs.mysql.com/bug.php?id=41659 - Quassnoi
显示剩余2条评论
2个回答

2

这里有一个不太好看的技巧,因为:

select id, temperature 
from periods join days 
    on day between first and last;

为了让查询更快,我们可以尝试激励优化器先评估它。但单独使用子查询是不够的:

select id, avg(temperature) 
from (
    select id, temperature 
    from periods 
    join days 
        on day between first and last
) as t 
group by id;
[...]
10 rows in set (1.67 sec)

然而,在子查询中调用一个非确定性函数似乎能够解决问题:
select id, avg(temperature) 
from (
    select id, temperature, rand() 
    from periods 
    join days 
        on day between first and last
) as t 
group by id;
[...]
10 rows in set (0.00 sec)

除非是至关重要的,我建议不要使用这样的技巧。随着优化器变得更好(也许在下一个修复版本中),它可能会跳过对rand()的调用,突然间你的旧计划和性能就回来了。

如果您确实使用了这些技巧,请务必在代码中仔细记录它们,以便在不再需要时清理它们。

MariaDB [test]> select @@version;
+-----------------+
| @@version       |
+-----------------+
| 10.0.20-MariaDB |
+-----------------+
1 row in set (0.00 sec)

explain select id, avg(temperature) from periods join days on day between first and last group by id;
| id   | select_type | table   | type  | possible_keys      | key     | key_len | ref  | rows    | Extra                                           |
|    1 | SIMPLE      | periods | index | first,last,first_2 | first_2 | 8       | NULL |      10 | Using index; Using temporary; Using filesort    |
|    1 | SIMPLE      | days    | ALL   | PRIMARY            | NULL    | NULL    | NULL | 2094315 | Using where; Using join buffer (flat, BNL join) |

explain select id, avg(temperature) from (select id, temperature from periods join days on day between first and last) as t group by id;
| id   | select_type | table   | type  | possible_keys      | key     | key_len | ref  | rows    | extra                                           |
|    1 | SIMPLE      | periods | index | first,last,first_2 | first_2 | 8       | NULL |      10 | Using index; Using temporary; Using filesort    |
|    1 | SIMPLE      | days    | ALL   | PRIMARY            | NULL    | NULL    | NULL | 2094315 | Using where; Using join buffer (flat, BNL join) |


explain select id, avg(temperature) from (select id, temperature, rand() from periods join days on day between first and last) as t group by id;
| id   | select_type | table      | type  | possible_keys      | key     | key_len | ref  | rows    | Extra                                          |
|    1 | PRIMARY     | <derived2> | ALL   | NULL               | NULL    | NULL    | NULL |       2 | Using temporary; Using filesort                |
|    2 | DERIVED     | periods    | index | first,last,first_2 | first_2 | 8       | NULL |      10 | Using index                                    |
|    2 | DERIVED     | days       | ALL   | PRIMARY            | NULL    | NULL    | NULL | 2094315 | Range checked for each record (index map: 0x1) |

谢谢,这对我很有效 - 实际上,即使没有RAND(),这个查询也很快。 - wau
对我来说,使用示例数据的第一次和没有 rand() 的计划所需的时间大致相同。我添加了我得到的计划。我注意到我们使用不同的版本,也许这可以解释不同的行为。 - Lennart - Slava Ukraini

0
尝试在days (day, temperature)上创建一个复合覆盖索引。它应该能提高你的速度一些。

@Quassnoi,你能提供一个参考吗? - Ulrich Thomas Gabor
@Quassnoi,我在这个页面上没有看到任何支持你假设的内容?最接近的是InnoDB中,每个辅助索引记录都包含行的主键列,但这里没有辅助索引。我之所以问,是因为我发现只要有复合索引,InnoDB就会优先使用它。如果没有复合索引,它将执行全表扫描。这对你的说法构成了挑战。 - Ulrich Thomas Gabor
@GhostGambler:这不是假设,这就是InnoDB的工作原理。当您在表上定义主键时,InnoDB将其用作聚集索引 - Quassnoi
@是的,但聚集索引中除了主键外没有其他内容。而在复合索引中,所有值都会被考虑进去。 - Ulrich Thomas Gabor
@GhostGambler:你知道什么是聚集索引吗?除了主键之外,聚集索引中包含了所有的内容。 - Quassnoi
显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接