如何提高数据分析速度?

3
我需要优化如此庞大的数据集的分析方式,但我不确定下一步该怎么做。我已经拥有了相当数量的MySQL配置调整的经验。
我拥有这个InnoDB表格:
+----------------+--------------+------+-----+---------+----------------+
| Field          | Type         | Null | Key | Default | Extra          |
+----------------+--------------+------+-----+---------+----------------+
| id             | int(250)     | NO   | PRI | NULL    | auto_increment |
| memory         | int(15)      | YES  | MUL | NULL    |                |
| q              | varchar(250) | YES  | MUL | NULL    |                |
| created        | datetime     | YES  |     | NULL    |                |
| modified       | datetime     | YES  |     | NULL    |                |
| dt             | datetime     | YES  | MUL | NULL    |                |
| site_id        | int(250)     | NO   | MUL | NULL    |                |
| execution_time | int(11)      | YES  | MUL | NULL    |                |
+----------------+--------------+------+-----+---------+----------------+

以下是10行样例:

+-----------+----------+-----------------+---------------------+---------------------+---------------------+---------+----------------+
| id        | memory   | q               | created             | modified            | dt                  | site_id | execution_time |
+-----------+----------+-----------------+---------------------+---------------------+---------------------+---------+----------------+
| 266864867 | 38011080 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:04:44 |     890 |           1534 |
| 266864868 | 46090184 | node/16432      | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:04:46 |     890 |            840 |
| 266864869 | 50329248 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:05:16 |     890 |           2500 |
| 266864870 | 38011272 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:01 |     890 |           1494 |
| 266864871 | 46087732 | node/16432      | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:03 |     890 |            850 |
| 266864872 | 30304428 | node/303        | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:12 |     890 |            113 |
| 266864873 | 50329412 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:25 |     890 |           2465 |
| 266864874 | 28253112 | front_page      | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:25 |     890 |             86 |
| 266864875 | 28256044 | front_page      | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:08:32 |     890 |             81 |
| 266864876 | 38021072 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:08:55 |     890 |           1458 |
+-----------+----------+-----------------+---------------------+---------------------+---------------------+---------+----------------+

以下是表的索引:
+----------+------------+----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+
| Table    | Non_unique | Key_name             | Seq_in_index | Column_name    | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+
| memories |          0 | PRIMARY              |            1 | id             | A         |     8473766 |     NULL | NULL   |      | BTREE      |         |
| memories |          1 | index_dt             |            1 | dt             | A         |     1210538 |     NULL | NULL   | YES  | BTREE      |         |
| memories |          1 | index_execution_time |            1 | execution_time | A         |        2344 |     NULL | NULL   | YES  | BTREE      |         |
| memories |          1 | index_memory         |            1 | memory         | A         |     8473766 |     NULL | NULL   | YES  | BTREE      |         |
| memories |          1 | index_site_id        |            1 | site_id        | A         |          16 |     NULL | NULL   |      | BTREE      |         |
| memories |          1 | index_q              |            1 | q              | A         |      338950 |     NULL | NULL   | YES  | BTREE      |         |
+----------+------------+----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+

它存储了许多不同站点(site_id)的一百万条记录。对于给定的站点,可能有20,000行记录。存储的信息是每个页面请求的性能指标。如果重要的话,请注意以下非常规字段:memory字段表示脚本使用的内存量,q是路径,site_id是Sites表的引用。

我在这些数据上运行了两个较慢的查询。第一个查询获取了最耗费内存的25个页面:

Select 
  Memory.q, count(*) as count, 
  AVG(Memory.memory) as average_memory, 
  MAX(Memory.memory) as peak_memory,
  AVG(Memory.execution_time) as average_execution_time,
  MAX(Memory.execution_time) as peak_execution_time 
FROM Memory 
WHERE site_id = $some_site_id 
ORDER BY average_memory DESC 
GROUP BY Memory.q
LIMIT 25

第二个查询获取给定站点的最慢的平均25页:
Select 
  Memory.q, count(*) as count, 
  AVG(Memory.memory) as average_memory, 
  MAX(Memory.memory) as peak_memory,
  AVG(Memory.execution_time) as average_execution_time,
  MAX(Memory.execution_time) as peak_execution_time 
FROM Memory 
WHERE site_id = $some_site_id 
ORDER BY average_execution_time DESC 
GROUP BY Memory.q
LIMIT 25

我最近将表从MyISAM转换为InnoDB,这样这些读操作就不会锁定表格。这导致更新此表的操作排队并滞后。
除了向问题投入更多内存(以增加InnoDB缓存大小)之外,我想看看是否还有其他选择。我从未使用过NoSQL数据库,但据我所知,它们在这里并没有太大帮助,因为我使用聚合函数和查询。
如果有更好的方法来处理这些数据的存储和分析,请提出您的想法。
更新:
对查询进行分析显示慢的部分都在复制到临时表中。我将研究如何使此步骤更快。
+--------------------------------+----------+
| Status                         | Duration |
+--------------------------------+----------+
| starting                       | 0.000030 |
| checking query cache for query | 0.000065 |
| Opening tables                 | 0.000013 |
| System lock                    | 0.000004 |
| Table lock                     | 0.000014 |
| init                           | 0.000032 |
| optimizing                     | 0.000010 |
| statistics                     | 0.008119 |
| preparing                      | 0.000042 |
| Creating tmp table             | 0.000317 |
| executing                      | 0.000005 |
| Copying to tmp table           | 5.349280 |
| Sorting result                 | 0.006511 |
| Sending data                   | 0.000092 |
| end                            | 0.000005 |
| removing tmp table             | 0.001510 |
| end                            | 0.000007 |
| query end                      | 0.000004 |
| freeing items                  | 0.001163 |
| logging slow query             | 0.000006 |
| cleaning up                    | 0.000006 |
+--------------------------------+----------+
21 rows in set (0.01 sec)

mysql> show profile cpu for query 4;
+--------------------------------+----------+----------+------------+
| Status                         | Duration | CPU_user | CPU_system |
+--------------------------------+----------+----------+------------+
| starting                       | 0.000030 | 0.000000 |   0.000000 |
| checking query cache for query | 0.000065 | 0.000000 |   0.000000 |
| Opening tables                 | 0.000013 | 0.000000 |   0.000000 |
| System lock                    | 0.000004 | 0.000000 |   0.000000 |
| Table lock                     | 0.000014 | 0.000000 |   0.000000 |
| init                           | 0.000032 | 0.000000 |   0.000000 |
| optimizing                     | 0.000010 | 0.000000 |   0.000000 |
| statistics                     | 0.008119 | 0.001000 |   0.000000 |
| preparing                      | 0.000042 | 0.000000 |   0.000000 |
| Creating tmp table             | 0.000317 | 0.000000 |   0.000000 |
| executing                      | 0.000005 | 0.000000 |   0.000000 |
| Copying to tmp table           | 5.349280 | 0.687896 |   0.412937 |
| Sorting result                 | 0.006511 | 0.004999 |   0.001999 |
| Sending data                   | 0.000092 | 0.000000 |   0.000000 |
| end                            | 0.000005 | 0.000000 |   0.000000 |
| removing tmp table             | 0.001510 | 0.000000 |   0.001000 |
| end                            | 0.000007 | 0.000000 |   0.000000 |
| query end                      | 0.000004 | 0.000000 |   0.000000 |
| freeing items                  | 0.001163 | 0.000000 |   0.001000 |
| logging slow query             | 0.000006 | 0.000000 |   0.000000 |
| cleaning up                    | 0.000006 | 0.000000 |   0.000000 |
+--------------------------------+----------+----------+------------+

计算需要实时完成吗?您是否可以定期转储数据,然后离线处理它? - robbrit
谢谢您的询问。这些数据不需要实时运行。我已经有一个每小时处理程序,可以在1/24的站点上进行此计算,并将结果插入报告表中,这就是应用程序实时读取的内容。然而,这个滚动过程正在拖累系统。 - Justin
你可以使用MySQL创建一个“从”数据库,它是一个独立于主数据库但克隆数据的数据库。然后,您可以在从数据库上运行查询而不影响主数据库。 - robbrit
@robbrit - 谢谢,我以前设置过从节点。我希望有一种避免在问题上投入更多硬件的方法。 - Justin
在这个点上,根据mysql调谐器输出,我认为我可能只需要更多的内存。 - Justin
7个回答

4

虽然您没有展示键结构,但是可以看出site_id是多部分键(MUL)的一部分。请注意,如果它不是该多部分键中的第一个字段,则该键不能用于where子句。例如:

KEY somekey (field1, site_id, field3, ...)

如果要使用查询中的键,则where子句必须同时包括“field”和“site_id”。您不必按照键中列出的顺序使用字段(where site_id=.. and field1=...where field1=... and site_id=...相同),但是由于field1在键的定义中出现在site_id之前,因此您必须同时使用它才能使用整个键。对于您的q字段也是如此。它必须是被覆盖的键中的第一个,否则这些键将无法使用。

谢谢。你能解释一下为什么这个答案是相关的吗?我的意思是,这些查询工作得很好,只是慢了一点。也许我表达不够清楚。 - Justin
慢速的一个原因是缺少键或键配置错误。这在大型数据集上变得非常相关,当MySQL必须扫描整个表(可能是重复的)来完成其工作时,而有可用的索引将大大减少所需的时间。 - Marc B
谢谢!这个表确实有索引,但我对索引不是很熟悉,所以它们可能是错误的。我已经将它们添加到问题中了。它们看起来正确吗? - Justin

3
为了有效地设计InnoDB表,您需要了解InnoDB如何使用索引-特别是聚簇索引以及它们的工作原理。
背景阅读
请花些时间阅读以下文章和我的以前的答案: 您还可能会发现这个演示很有趣: 现在您对innodb架构有了更好的理解,我们来看一下如何优化您的模型以适用于innodb引擎。
由于您只提供了两个示例查询,我必须做出一些假设,因此以下设计是针对覆盖site_id和path的查询进行优化的。如果需要,我将让您进一步修改设计,因为您比我更了解您的数据。
修订后的模式(简化版)
我修改了您的设计并创建了3个表:site、site_request和site_request_metric。
Site表(1024行)
drop table if exists site;
create table site
(
site_id smallint unsigned not null auto_increment primary key,
url varchar(255) unique not null,
next_request_id int unsigned not null default 0
)
engine=innodb;

select count(*) from site;
+----------+
| count(*) |
+----------+
|     1024 |
+----------+

网站表 - 示例数据

+---------+------------------+-----------------+
| site_id | url              | next_request_id |
+---------+------------------+-----------------+
|       1 | www.site1.com    |             167 |
|       2 | www.site2.com    |             177 |
|       3 | www.site3.com    |              68 |
...
|    1022 | www.site1022.com |             203 |
|    1023 | www.site1023.com |              80 |
|    1024 | www.site1024.com |             239 |
+---------+------------------+-----------------+

大部分字段都是自解释的,但next_request_id是一个计数器字段,记录了一个网站有多少个请求(在你的示例中为路径或q)。例如,网站1024有239个单独的页面请求/路径,我们希望记录内存和执行指标。
此外,请注意我使用的数字数据类型——你的大多数数据类型定义不清,似乎混淆了可选的显示宽度说明符(仅与zerofill一起使用)和整数大小。选择尽可能小的数据类型非常重要,这样我们可以在innodb缓冲区中打包更多的数据。

http://dev.mysql.com/doc/refman/5.0/en/integer-types.html

网站请求表(192K行)

drop table if exists site_request;
create table site_request
(
site_id smallint unsigned not null,
request_id int unsigned not null,
created_date datetime not null,
path varchar(255) not null,
next_metric_id int unsigned not null default 0,
primary key (site_id, request_id)
)
engine=innodb;

select count(*) from site_request;
+----------+
| count(*) |
+----------+
|   192336 |
+----------+

网站请求表 - 示例数据

+---------+------------+---------------------+----------------------+----------------+
| site_id | request_id | created_date        | path                 | next_metric_id |
+---------+------------+---------------------+----------------------+----------------+
|       1 |          1 | 2011-12-14 17:17:41 | www.site1.com/1      |            250 |
|       1 |          2 | 2011-12-14 17:17:41 | www.site1.com/2      |            132 |
|       1 |          3 | 2011-12-14 17:17:41 | www.site1.com/3      |            345 |
...
|       1 |         166| 2011-12-14 17:17:41 | www.site1.com/166    |            342 |
|       1 |         167| 2011-12-14 17:17:41 | www.site1.com/167    |            231 |
...
|    1024 |          1 | 2011-12-14 17:17:58 | www.site1024.com/1   |            241 |
|    1024 |          2 | 2011-12-14 17:17:58 | www.site1024.com/2   |            266 |
...
|    1024 |        236 | 2011-12-14 17:17:58 | www.site1024.com/236 |            466 |
|    1024 |        237 | 2011-12-14 17:17:58 | www.site1024.com/237 |            459 |
|    1024 |        238 | 2011-12-14 17:17:58 | www.site1024.com/238 |            389 |
|    1024 |        239 | 2011-12-14 17:17:58 | www.site1024.com/239 |            592 |
+---------+------------+---------------------+----------------------+----------------+

大部分字段都是自解释的。该表的主键是由site_id和request_id组成的复合主键,因此站点1有167个单独的请求/路径,站点1024有239个。

要选择一个单独的请求,您必须指定site_id和request_id:

select * from site_request where site_id = 1 and request_id = 167
+---------+------------+---------------------+-------------------+----------------+
| site_id | request_id | created_date        | path              | next_metric_id |
+---------+------------+---------------------+-------------------+----------------+
|       1 |        167 | 2011-12-14 17:17:41 | www.site1.com/167 |            231 |
+---------+------------+---------------------+-------------------+----------------+
1 row in set (0.00 sec)

select * from site_request where site_id = 1024 and request_id = 167
+---------+------------+---------------------+----------------------+----------------+
| site_id | request_id | created_date        | path                 | next_metric_id |
+---------+------------+---------------------+----------------------+----------------+
|    1024 |        167 | 2011-12-14 17:17:58 | www.site1024.com/167 |            175 |
+---------+------------+---------------------+----------------------+----------------+
1 row in set (0.00 sec)

如果我想向一个网站添加新的请求,我们使用site.next_request_id + 1来生成给定site_id的下一个复合主键值。通常使用触发器完成如下操作:
delimiter #

create trigger site_request_before_ins_trig before insert on site_request
for each row
begin
declare v_id int unsigned default 0;

  select next_request_id + 1 into v_id from site where site_id = new.site_id;
  set new.request_id = v_id, new.created_date = now();
  update site set next_request_id = v_id where site_id = new.site_id;
end#

delimiter ;

为什么我不只是创建一个自动增量主键和一个站点ID的二级索引呢?

create table site_request
(
request_id int unsigned not null auto_increment primary key,
site_id smallint unsigned not null,
...
key (site_id)
)
engine=innodb;

我假设大部分查询都涉及site_id和path,因此在site_id上对请求表进行聚类是一种值得的优化,即使插入开销会略微增加。我更关心读取性能,特别是这个表将与庞大的指标表后续连接。

网站请求指标表(7400万行)

drop table if exists site_request_metric;
create table site_request_metric
(
site_id smallint unsigned not null,
request_id int unsigned not null,
metric_id int unsigned not null,
created_date datetime not null,
memory_usage int unsigned not null default 0,
execution_time mediumint unsigned not null default 0,
primary key (site_id, request_id, metric_id)
)
engine=innodb;

select count(*) from site_request_metric;
+----------+
| count(*) |
+----------+
| 73858764 |
+----------+

网站请求度量表 - 示例数据

+---------+------------+-----------+---------------------+--------------+----------------+
| site_id | request_id | metric_id | created_date        | memory_usage | execution_time |
+---------+------------+-----------+---------------------+--------------+----------------+
|       1 |          1 |         1 | 2011-12-14 17:17:58 |     18052380 |       7731 |
|       1 |          1 |         2 | 2011-12-14 17:17:58 |     32013204 |       7881 |
|       1 |          1 |         3 | 2011-12-14 17:17:58 |     55779470 |       7274 |
...
|       1 |          1 |       249 | 2011-12-14 17:17:58 |     11527748 |       5126 |
|       1 |          1 |       248 | 2011-12-14 17:17:58 |     19457506 |       4097 |
|       1 |          1 |       247 | 2011-12-14 17:17:58 |     23129432 |       6202 |
...
|     997 |          1 |         1 | 2011-12-14 19:08:48 |     38584043 |       7156 |
|     997 |          1 |         2 | 2011-12-14 19:08:48 |     68884314 |       2185 |
|     997 |          1 |         3 | 2011-12-14 19:08:48 |     31545597 |        207 |
...
|     997 |          1 |       380 | 2011-12-14 19:08:49 |     39123978 |        166 |
|     997 |          1 |       381 | 2011-12-14 19:08:49 |     45114404 |       7310 |
|     997 |          1 |       382 | 2011-12-14 19:08:49 |     55057884 |        506 |    +---------+------------+-----------+---------------------+--------------+----------------+

site_request_metric.next_metric_id字段的工作方式类似于site.next_request_id计数器字段,并使用触发器进行维护。

delimiter #

create trigger site_request_metric_before_ins_trig before insert on site_request_metric
for each row
begin
declare v_id int unsigned default 0;

  select next_metric_id + 1 into v_id from site_request where site_id = new.site_id and request_id = new.request_id;
  set new.metric_id = v_id, new.created_date = now();
  update site_request set next_metric_id = v_id where site_id = new.site_id and request_id = new.request_id;
end#

delimiter ;

模式性能

以997网站为例:

select * from site where site_id = 997;
+---------+-----------------+-----------------+
| site_id | url             | next_request_id |
+---------+-----------------+-----------------+
|     997 | www.site997.com |             319 |
+---------+-----------------+-----------------+
1 row in set (0.00 sec)

网站997有319个单独的页面请求/路径。

select * from site_request where site_id = 997;
+---------+------------+---------------------+---------------------+----------------+
| site_id | request_id | created_date        | path                | next_metric_id |
+---------+------------+---------------------+---------------------+----------------+
|     997 |          1 | 2011-12-14 17:17:58 | www.site997.com/1   |            383 |
|     997 |          2 | 2011-12-14 17:17:58 | www.site997.com/2   |            262 |
|     997 |          3 | 2011-12-14 17:17:58 | www.site997.com/3   |            470 |
|     997 |          4 | 2011-12-14 17:17:58 | www.site997.com/4   |            247 |
...
|     997 |        316 | 2011-12-14 17:17:58 | www.site997.com/316 |            176 |
|     997 |        317 | 2011-12-14 17:17:58 | www.site997.com/317 |            441 |
|     997 |        318 | 2011-12-14 17:17:58 | www.site997.com/318 |            419 |
|     997 |        319 | 2011-12-14 17:17:58 | www.site997.com/319 |            601 |
+---------+------------+---------------------+---------------------+----------------+
319 rows in set (0.00 sec)

我们有多少度量衡可以用于997站点的所有请求?
select sum(next_metric_id) from site_request where site_id = 997;
+---------------------+
| sum(next_metric_id) |
+---------------------+
|              130163 |
+---------------------+
1 row in set (0.00 sec)

对于这个站点,将next_metric_id相加(如上所述)比通常的方法更快:

select count(*) from site_request_metric where site_id = 997;
+----------+
| count(*) |
+----------+
|   130163 |
+----------+
1 row in set (0.03 sec)

所以,站点997有大约130K的内存和执行时间指标需要在大约7400万行的表中进行分析。

接下来让我们尝试主要查询:(所有运行时间都是冷启动,即mysql重新启动,空缓存并且没有查询缓存!)

内存

select
 hog.*,
 sr.path
from
(
select 
 srm.site_id,
 srm.request_id,
 count(*) as counter, 
 avg(srm.memory_usage) as average_memory, 
 max(srm.memory_usage) as peak_memory,
 avg(srm.execution_time) as average_execution_time,
 max(srm.execution_time) as peak_execution_time 
from
 site_request_metric srm
where
 srm.site_id = 997
group by 
 srm.site_id,
 srm.request_id
order by
 average_memory desc
limit 25
) hog
inner join site_request sr on hog.site_id = sr.site_id and hog.request_id = sr.request_id;

结果如下:
+---------+------------+---------+----------------+-------------+------------------------+---------------------+---------------------+
| site_id | request_id | counter | average_memory | peak_memory | average_execution_time | peak_execution_time | path                |
+---------+------------+---------+----------------+-------------+------------------------+---------------------+---------------------+
|     997 |        103 |     184 |  43381803.4293 |    69682361 |              4378.1630 |                8069 | www.site997.com/103 |
|     997 |        151 |     158 |  42594703.1392 |    69329761 |              4422.8481 |                8080 | www.site997.com/151 |
|     997 |        192 |     509 |  42470135.3360 |    69927112 |              4083.1198 |                8098 | www.site997.com/192 |
|     997 |        248 |     161 |  42169276.5590 |    69995565 |              4118.1180 |                7949 | www.site997.com/248 |
|     997 |        221 |     162 |  42156708.4877 |    69233026 |              4151.1667 |                8022 | www.site997.com/221 |
|     997 |        136 |     154 |  42026979.3831 |    69897045 |              4060.5649 |                8098 | www.site997.com/136 |
|     997 |        239 |     424 |  41979697.9788 |    69381215 |              4463.0189 |                8087 | www.site997.com/239 |
|     997 |         77 |     338 |  41864013.0266 |    69991164 |              3942.4142 |                8067 | www.site997.com/77  |
|     997 |        283 |     249 |  41853642.9157 |    69945794 |              3915.7028 |                8034 | www.site997.com/283 |
|     997 |          5 |     228 |  41815274.7851 |    69825743 |              3898.4123 |                8078 | www.site997.com/5   |
|     997 |        216 |     319 |  41766464.5078 |    69777901 |              3899.0752 |                8091 | www.site997.com/216 |
|     997 |        131 |     170 |  41720890.5118 |    69892577 |              4074.2588 |                8097 | www.site997.com/131 |
|     997 |        160 |     385 |  41702556.6545 |    69868379 |              4060.2727 |                8093 | www.site997.com/160 |
|     997 |        245 |     200 |  41683505.3900 |    69668739 |              4052.7950 |                8095 | www.site997.com/245 |
|     997 |         70 |     429 |  41640396.0466 |    69988619 |              3995.3310 |                8099 | www.site997.com/70  |
|     997 |         98 |     485 |  41553544.7649 |    69957698 |              4048.1443 |                8096 | www.site997.com/98  |
|     997 |        153 |     301 |  41542909.4651 |    69754024 |              3884.7409 |                8028 | www.site997.com/153 |
|     997 |        226 |     429 |  41523530.3939 |    69691453 |              4097.7226 |                8096 | www.site997.com/226 |
|     997 |         31 |     478 |  41442100.4435 |    69802248 |              3999.3096 |                8098 | www.site997.com/31  |
|     997 |        171 |     222 |  41405805.8153 |    69433643 |              4364.4414 |                8087 | www.site997.com/171 |
|     997 |        150 |     336 |  41393538.5744 |    69746950 |              4264.5655 |                8077 | www.site997.com/150 |
|     997 |        167 |     526 |  41391595.5741 |    69633242 |              4206.1597 |                8096 | www.site997.com/167 |
|     997 |        182 |     593 |  41288151.5379 |    69992913 |              4351.6476 |                8099 | www.site997.com/182 |
|     997 |         14 |     555 |  41239680.5387 |    69976632 |              4054.6126 |                8084 | www.site997.com/14  |
|     997 |        297 |     410 |  41163572.3805 |    69874576 |              4001.0829 |                8039 | www.site997.com/297 |
+---------+------------+---------+----------------+-------------+------------------------+---------------------+---------------------+
25 rows in set (0.41 sec)

执行时间

select
 hog.*,
 sr.path
from
(
select 
 srm.site_id,
 srm.request_id,
 count(*) as counter, 
 avg(srm.memory_usage) as average_memory, 
 max(srm.memory_usage) as peak_memory,
 avg(srm.execution_time) as average_execution_time,
 max(srm.execution_time) as peak_execution_time 
from
 site_request_metric srm
where
 srm.site_id = 997
group by 
 srm.site_id,
 srm.request_id
order by
 average_execution_time desc
limit 25
) hog
inner join site_request sr on hog.site_id = sr.site_id and hog.request_id = sr.request_id;

结果如下:
+---------+------------+---------+----------------+-------------+------------------------+---------------------+---------------------+
| site_id | request_id | counter | average_memory | peak_memory | average_execution_time | peak_execution_time | path                |
+---------+------------+---------+----------------+-------------+------------------------+---------------------+---------------------+
|     997 |        213 |     159 |  37962517.1321 |    67120491 |              4497.9119 |                8055 | www.site997.com/213 |
|     997 |        239 |     424 |  41979697.9788 |    69381215 |              4463.0189 |                8087 | www.site997.com/239 |
|     997 |        151 |     158 |  42594703.1392 |    69329761 |              4422.8481 |                8080 | www.site997.com/151 |
|     997 |        289 |     382 |  39227749.9869 |    69715783 |              4402.8927 |                8093 | www.site997.com/289 |
|     997 |         69 |     473 |  40099817.4715 |    69798587 |              4380.6850 |                8092 | www.site997.com/69  |
|     997 |        103 |     184 |  43381803.4293 |    69682361 |              4378.1630 |                8069 | www.site997.com/103 |
|     997 |        183 |     236 |  40111564.1356 |    69853507 |              4376.4280 |                8032 | www.site997.com/183 |
|     997 |        171 |     222 |  41405805.8153 |    69433643 |              4364.4414 |                8087 | www.site997.com/171 |
|     997 |         58 |     212 |  39289163.9057 |    69861740 |              4355.8396 |                8087 | www.site997.com/58  |
|     997 |         71 |     388 |  39895200.6108 |    69801188 |              4353.9639 |                8086 | www.site997.com/71  |
|     997 |        182 |     593 |  41288151.5379 |    69992913 |              4351.6476 |                8099 | www.site997.com/182 |
|     997 |        195 |     305 |  39780792.6066 |    69824981 |              4343.0295 |                8081 | www.site997.com/195 |
|     997 |        318 |     419 |  39860696.4415 |    69958266 |              4323.6420 |                8071 | www.site997.com/318 |
|     997 |        303 |     318 |  39357663.3899 |    69850523 |              4322.4686 |                8097 | www.site997.com/303 |
|     997 |        198 |     306 |  38990104.1699 |    69851817 |              4320.0621 |                8088 | www.site997.com/198 |
|     997 |        286 |     227 |  39654671.5859 |    69871305 |              4307.8811 |                8055 | www.site997.com/286 |
|     997 |        105 |     611 |  39055749.5008 |    69813117 |              4296.0802 |                8090 | www.site997.com/105 |
|     997 |        298 |     388 |  40150371.2474 |    69985665 |              4286.9716 |                8095 | www.site997.com/298 |
|     997 |         84 |     517 |  39520438.9497 |    69990404 |              4283.3578 |                8098 | www.site997.com/84  |
|     997 |        106 |     448 |  41099495.4018 |    69902616 |              4282.6094 |                8082 | www.site997.com/106 |
|     997 |        237 |     431 |  39017341.3387 |    69623443 |              4277.4872 |                8071 | www.site997.com/237 |
|     997 |         55 |     381 |  39603109.8294 |    69750984 |              4269.1969 |                8095 | www.site997.com/55  |
|     997 |         34 |     438 |  40697744.4087 |    69843517 |              4266.3288 |                8047 | www.site997.com/34  |
|     997 |         38 |     433 |  40169799.8291 |    69898182 |              4266.1663 |                8088 | www.site997.com/38  |
|     997 |        150 |     336 |  41393538.5744 |    69746950 |              4264.5655 |                8077 | www.site997.com/150 |
+---------+------------+---------+----------------+-------------+------------------------+---------------------+---------------------+
25 rows in set (0.30 sec)

这两个查询在包含约7400万行的表中的冷运行时间都不到0.5秒(后续运行时间约为0.06秒),但这并不是一个明确的答案,因为还有许多其他因素可能会影响表和索引设计,而我没有考虑。然而,这应该能让你了解如何通过简单的表/索引设计显著提高innodb查询性能。希望这可以帮到你:)
完整脚本在这里:http://pastie.org/3022142

哇!真是太棒了……非常感谢您提供这些信息。我会仔细研究的。 - Justin

2
我建议先使用内置分析器来分析查询。
mysql> SET profiling = 1;
mysql> <your query>;
mysql> SHOW PROFILES;
mysql> SHOW PROFILE FOR QUERY <id of your query>;
mysql> SHOW PROFILE CPU FOR QUERY <id of your query>;

请注意,性能分析不是免费的,因此只有在网站可以处理它时,才应该执行它,可能在实时系统的副本上执行。

这是一个很棒的想法!我已经添加了输出。谢谢。 - Justin

0

我会添加另一个字段,其中包含“q”的MD5哈希,并使用该字段的值进行分组。

在varchar(250)上建立索引并按字段值进行分组不是一个好主意。

您需要在(site_id,q_hash)上创建复合索引。


0

如果我正确地理解了你的问题(和评论),那么问题在于这些查询使系统变得缓慢。

其他答案为优化指出了好的方向(修复索引,使用分析器等)。

另一个策略是设置复制,并对从服务器运行这些重型查询。主服务器将保持稳定,写入二进制日志,而从服务器将在查询完成后追赶上来。这种设置允许您使用长时间运行的查询来攻击从服务器,而不影响主服务器的写入性能。


0
你真正需要的是两个良好的索引来支持你提出的查询。
你目前拥有的索引不足,因为数据仍将与 MySQL 查询优化器选择的任何索引一起从表中检索。
@MarkB的答案从理论上讲是你想要的(+1给@MarkB)。你只需要使索引符合任何给定查询的标准:
  1. WHERE子句
  2. ORDER BY 子句
  3. GROUP BY 子句
  4. 必要列(不在 WHEREORDER BYGROUP BY 中)
让我们来看看你的第一个查询:
Select  
  Memory.q, count(*) as count,  
  AVG(Memory.memory) as average_memory,  
  MAX(Memory.memory) as peak_memory, 
  AVG(Memory.execution_time) as average_execution_time, 
  MAX(Memory.execution_time) as peak_execution_time  
FROM Memory  
WHERE site_id = $some_site_id  
ORDER BY average_memory DESC  
GROUP BY Memory.q 
LIMIT 25 

请看以下四个标准:

  • WHERE 只有一个值,[site_id]
  • ORDER BY 将在 WHERE 内排序,[average_memory]
  • GROUP BY 将在 ORDER BY 内排序,[q]
  • 必要的列:[memory],[execution_time]

方括号内的内容是按照所示顺序放置索引的内容。这是索引:

ALTER TABLE Memory ADD INDEX siteid_q_mem_exectime_index
(site_id,q,memory,execution_time);

请注意,average_memory 不是表列。 它是从 memory 字段派生出来的。
现在,对第二个查询执行相同的操作:
Select       
  Memory.q, count(*) as count,       
  AVG(Memory.memory) as average_memory,       
  MAX(Memory.memory) as peak_memory,      
  AVG(Memory.execution_time) as average_execution_time,      
  MAX(Memory.execution_time) as peak_execution_time       
FROM Memory       
WHERE site_id = $some_site_id       
ORDER BY average_execution_time DESC       
GROUP BY Memory.q      
LIMIT 25      

看一下这四个标准:

  • WHERE 只有一个值,[site_id]
  • ORDER BY 将在 WHERE 内排序,[average_execution]
  • GROUP BY 将在 ORDER BY 内排序,[q]
  • 必要的列: [memory],[execution_time]

结果将与之前相同。因此,您不需要另一个索引。

这里是它:

ALTER TABLE Memory ADD INDEX siteid_q_mem_exectime_index
(site_id,q,memory,execution_time);

为什么这个索引如此重要?

  • ORDER BYGROUP BY 通常会触发临时表的内部排序操作。如果表被正确地索引,当遍历索引时,数据已按需要排序。
  • 所需列(memory,**execution_time**)在索引中有很好的原因。如果索引包含结果集所需的每一列,MySQL 将不会访问表。它只会从索引中读取所需的数据。这将减少磁盘 I/O。

以这种方式创建的索引称为“覆盖索引”。

以下是一些关于此主题的好链接。享受吧!


-1
首先,从我看到的情况来看,你必须避免使用GROUP BY - 它会占用大量内存。将其拆分为两个查询。同时,按照Marc B的建议创建索引。

分组是数据分析所必需的。 - Justin

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接