如何在数百万行中优化计数和排序查询

Question

如何在数百万行中优化计数和排序查询

mysqldatabasedatabase-administration

15

需要优化order by和count查询，我的表有数百万行数据（大约300万行）。

我必须连接4个表并获取记录，当我运行简单的查询时，它只需要毫秒就能完成，但是当我尝试使用左连接表进行计数或排序时，它会无限期地卡住。

请查看下面的情况。

数据库服务器配置：

CPU Number of virtual cores: 4
Memory(RAM): 16 GiB
Network Performance: High

每个表中的行数：

tbl_customers -  #Rows: 20 million.
tbl_customers_address -  #Row 25 million.
tbl_shop_setting - #Rows 50k
aio_customer_tracking - #Rows 5k

表结构：

CREATE TABLE `tbl_customers` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
    `shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
    `shop_id` BIGINT(20) UNSIGNED NOT NULL,
    `email` VARCHAR(225) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `accepts_marketing` TINYINT(1) NULL DEFAULT NULL,
    `first_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `last_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `last_order_id` BIGINT(20) NULL DEFAULT NULL,
    `total_spent` DECIMAL(12,2) NULL DEFAULT NULL,
    `phone` VARCHAR(20) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `verified_email` TINYINT(4) NULL DEFAULT NULL,
    `updated_at` DATETIME NULL DEFAULT NULL,
    `created_at` DATETIME NULL DEFAULT NULL,
    `date_updated` DATETIME NULL DEFAULT NULL,
    `date_created` DATETIME NULL DEFAULT NULL,
    PRIMARY KEY (`id`),
    UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`),
    INDEX `email` (`email`),
    INDEX `shopify_customer_id` (`shopify_customer_id`),
    INDEX `shop_id` (`shop_id`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;


CREATE TABLE `tbl_customers_address` (
    `id` BIGINT(20) NOT NULL AUTO_INCREMENT,
    `customer_id` BIGINT(20) NULL DEFAULT NULL,
    `shopify_address_id` BIGINT(20) NULL DEFAULT NULL,
    `shopify_customer_id` BIGINT(20) NULL DEFAULT NULL,
    `first_name` VARCHAR(50) NULL DEFAULT NULL,
    `last_name` VARCHAR(50) NULL DEFAULT NULL,
    `company` VARCHAR(50) NULL DEFAULT NULL,
    `address1` VARCHAR(250) NULL DEFAULT NULL,
    `address2` VARCHAR(250) NULL DEFAULT NULL,
    `city` VARCHAR(50) NULL DEFAULT NULL,
    `province` VARCHAR(50) NULL DEFAULT NULL,
    `country` VARCHAR(50) NULL DEFAULT NULL,
    `zip` VARCHAR(15) NULL DEFAULT NULL,
    `phone` VARCHAR(20) NULL DEFAULT NULL,
    `name` VARCHAR(50) NULL DEFAULT NULL,
    `province_code` VARCHAR(5) NULL DEFAULT NULL,
    `country_code` VARCHAR(5) NULL DEFAULT NULL,
    `country_name` VARCHAR(50) NULL DEFAULT NULL,
    `longitude` VARCHAR(250) NULL DEFAULT NULL,
    `latitude` VARCHAR(250) NULL DEFAULT NULL,
    `default` TINYINT(1) NULL DEFAULT NULL,
    `is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`),
    INDEX `customer_id` (`customer_id`),
    INDEX `shopify_address_id` (`shopify_address_id`),
    INDEX `shopify_customer_id` (`shopify_customer_id`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;

CREATE TABLE `tbl_shop_setting` (
    `id` INT(11) NOT NULL AUTO_INCREMENT,   
    `shop_name` VARCHAR(300) NOT NULL COLLATE 'latin1_swedish_ci',
     PRIMARY KEY (`id`),
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;


CREATE TABLE `aio_customer_tracking` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
    `shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
    `email` VARCHAR(255) NULL DEFAULT NULL,
    `shop_id` BIGINT(20) UNSIGNED NOT NULL,
    `domain` VARCHAR(255) NULL DEFAULT NULL,
    `web_session_count` INT(11) NOT NULL,
    `last_seen_date` DATETIME NULL DEFAULT NULL,
    `last_contact_date` DATETIME NULL DEFAULT NULL,
    `last_email_open` DATETIME NULL DEFAULT NULL,
    `created_date` DATETIME NOT NULL,
    `is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`),
    INDEX `shopify_customer_id` (`shopify_customer_id`),
    INDEX `email` (`email`),
    INDEX `shopify_customer_id_shop_id` (`shopify_customer_id`, `shop_id`),
    INDEX `last_seen_date` (`last_seen_date`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;

查询正在运行和未运行的案例：

1. Running:  Below query fetch the records by joining all the 4 tables, It takes only 0.300 ms.

SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20

2. Not running: Simply when try to get the count of these row stuk the query, I waited 10 min but still running.

SELECT 
     COUNT(DISTINCT c.shopify_customer_id)   -- what makes #2 different
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20


3. Not running: In the #1 query we simply put the 1 Order by clause and it get stuck, I waited 10 min but still running. I study query optimization some article and tried by indexing, Right Join etc.. but still not working.

SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
  ORDER BY `t`.`last_seen_date`    -- what makes #3 different
LIMIT 20

查询解释 #1:

查询解释 #2:

查询解释 #3:

欢迎提出优化查询和表结构的建议。

我的目标：

tbl_customers 表包含客户信息，tbl_customer_address 表包含客户地址（一个客户可能有多个地址），aio_customer_tracking 表包含客户访问记录，其中 last_seen_date 是最近的访问日期。

现在，我想获取并计算客户数量、其中一个地址以及访问信息。同时，我可以根据这三个表中的任何列进行排序。在我的例子中，我按照最近访问日期（默认排序）进行排序。希望这个解释能够帮助您理解我要做什么。

- Irfan.gwb

3

很遗憾，您的查询语句有误，我们无法确定您想要获取的结果，因此我们不能告诉您如何对其进行优化（甚至修复）。例如：select last_seen_date from table group by id会随机返回一个last_seen_date行（请参见例如这里），然后您想要按此排序。在您的第二个查询中，count(distinct x) group by x是多余的（它始终为1，这就是group by的目的），同时您使用left join连接到它也没有效果（但是再次强调，您可能想要查询不同的内容）。 - Solarflare

1

按照 last_seen_date 排序将首先列出 null；2000万客户和（最多）5k个具有 last_seen_date 的客户将使您获得19,995,000行，其中 null 排在第一位（因此为了优化，只需删除 order by）。如果您想获取具有实际 last_seen_date 的行，则查询可能会在不到一秒钟内完成（删除 left join 可能已经可以），但是再次强调，我们并不真正知道您的期望结果是什么，因此在尝试优化之前，请尝试让其正常工作/给出正确的结果。（如果您添加详细信息/示例数据/期望结果，我们可以帮助您解决问题）。 - Solarflare

@Solarflare，感谢您的评论。实际上，aio_customer_tracking表包含了客户的访问日志，我想获取带有他们最后一次访问日期的客户，最近访问的客户排在前面，这就是为令我按照last_seen_date排序的原因。希望这样说得清楚明白。 - Irfan.gwb

请突出显示这些查询之间不同的代码。 - Rick James

请为每个查询提供 EXPLAIN SELECT ...。 - Rick James

显示剩余7条评论

4个回答

4

shopify_customer_id 在 tbl_customers 表中是唯一的，在第二个查询中，为什么要在 shopify_customer_id 列上使用 distinct 和 group by？

请去掉这些。

- mamun0024

LEFT JOIN表格tbl_customers_address可能包含多条关于shopify_customer_id的记录，我需要获取唯一的客户行。因此，我添加了group by和distinct。 - Irfan.gwb

@Irfan.gwb - GROUP BY 的作用类似于 DISTINCT。mamun 建议放弃使用 DISTINCT。 - Rick James

1

如果有多个地址，你想获取哪一个？我的观点是，在每个查询中，GROUP BY 都是“无效的”。 - Rick James

我对任何地址行都没有问题，这个没关系。 - Irfan.gwb

1

查询2存在逻辑错误，正如其他人指出的那样：count(distinct(c.shopify_customer_id))将返回一个单一的值，因此您的分组只会使查询复杂化（这可能确实使MySQL首先按shopify_customer_id进行分组，然后执行count(distinct(shopify_customer_id))，这可能是执行时间较长的原因）。

查询3的排序无法优化，因为您正在连接子选择，该子选择无法索引。需要花费的时间仅为系统需要对结果集进行排序的时间。

解决问题的方法是：

将表tbl_customers_address的索引shopify_customer_id（shopify_customer_id）更改为shopify_customer_id（shopify_customer_id，default），以优化以下查询。
创建一个表格，其中包含查询1（结果）的结果，但不包括 LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id。
修改结果表并添加一列last_seen_date和last_seen_date和shopify_customer_id的索引。
为此查询的结果（last_Date）创建一个表：

SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id

使用来自表last_Date的值更新结果表。

现在，您可以针对结果表运行查询，并使用您创建的索引按last_Date排序。

整个过程所需时间应远少于执行查询2或查询3。

- iLikeMySql

0

你的索引太多了，这会导致插入、更新和删除操作的性能严重下降，并且在某些情况下对查询操作也会有影响，具体取决于优化设置。

此外，请移除 GROUP BY 语句。

对于查询优化，还有很多可以说的事情，比如正确使用聚集索引和非聚集索引，以及如何处理 GROUP BY, ORDER BY, WHERE, 视图等操作。然而，我认为如果你移除一些索引，查询速度就会大幅提升。（当然，你也可以重新编写查询语句，遵循更严格的 SQL 标准并且更合理，但这超出了本问题的范围。）

还有一个问题- 你打算如何处理查询结果？是将其存储并用于查找、计算、自动报告、通过 Web 数据库连接进行显示等吗？这很重要，因为如果你只需要报告/备份或者导出到平面文件中，那么有更有效的方法来获取这些数据。根据你所做的具体任务，有很多不同的选择。

- a lead alcove

删除索引可以稍微提高INSERTs的性能，但对于SELECTs来说可能会成为性能杀手。在MySQL中，我无法想象出“太多索引”会成为“性能杀手”的情况。SELECTs根本不会加速。 - Rick James

完全正确。我应该更具体地说明查询与插入对性能的影响。老实说，我只是浏览了她/他的问题，看到一些我不喜欢的连接和分组操作。大多数回答只涉及CRUD中的R，但我认为值得解决。在某些情况下，过多的索引确实会减慢查询速度，至少在SQL Server中是如此，但这是例外。我应该花更多时间阅读他们的完整问题，并在我的答案中提供更多细节。然而，插入肯定比“稍微”受到更多影响。 - a lead alcove

1

请考虑编辑您的回答以反映您刚才说的内容。 - Rick James

谢谢，我刚刚完成了。 - a lead alcove

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rick James · Accepted Answer

在查询 #1 中，但不在另外两个查询中，优化器可以使用。

UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`)

为了简化查询，请以下操作：

GROUP BY c.shopify_customer_id
LIMIT 20

这是因为它可以在索引的前20个项目后停止。查询不是超级快，因为派生表（子查询）会命中约51K行。

查询#2可能很慢，只是因为优化器未能注意并删除冗余的DISTINCT。相反，它可能认为它不能在20个之后停止。

查询#3必须完全通过表c来获取每个shopify_customer_id组。这是因为ORDER BY防止了短路以到达LIMIT 20。

GROUP BY中的列必须包括SELECT中的所有非聚合列，除了由GROUP BY列唯一定义的任何列。由于您已经说过一个shopify_customer_id可能有多个地址，所以在连接GROUP BY shopify_customer_id时获取ca.address1是不正确的。同样，子查询似乎与last_seen_date、last_contact_date不符合规范。

在aio_customer_tracking中，将其更改为“覆盖”索引可能会有所帮助：

INDEX (`shopify_customer_id`)

to

INDEX (`shopify_customer_id`, `last_seen_date`, `last_contact_date`)

拆解目标

现在，我想要……计算客户数

为了计算客户数，请按照以下步骤进行，但不要尝试与“获取”操作结合：

SELECT COUNT(*) FROM tbl_customers;

现在，我只想提取客户的一个地址和访问信息。tbl_customers-#行：2000万。您肯定不想提取2000万行！我不想考虑如何尝试做到这一点。请澄清一下。我不接受浏览那么多行。也许有一个WHERE子句？WHERE子句（通常）是优化的最重要部分！

假设WHERE过滤到了“几个”客户，然后加入另一个表以获取“任何”地址和“任何”访问信息可能会有问题和/或效率低下。要求“第一个”或“最后一个”而不是“任何一个”并不容易，但可能更有意义。

我可以建议您的UI首先找到一些客户，然后如果用户想要，转到另一页，其中包含所有地址和所有访问。或者访问可以达到数百个或更多吗？

让我们专注于优化WHERE，然后将last_seen_date添加到任何索引的末尾。此外，我可以按照这3个表中的任何一列排序，在我的示例中，我按last_seen_date（默认顺序）排序。