左连接中带有分组的MySQL子查询 - 优化

3
MySQL似乎无法优化带有GROUP BY子查询的选择,导致执行时间长。对于这种常见情况,必须有已知的优化方法。
假设我们试图从数据库返回所有订单,并指示它是否为客户的第一笔订单。
CREATE TABLE orders (order int, customer int, date date);

通过客户检索第一笔订单非常快速。

SELECT customer, min(order) as first_order FROM orders GROUP BY customer;

然而,一旦我们使用子查询将其与完整的订单集合连接,它就变得非常缓慢。
SELECT order, first_order FROM orders LEFT JOIN ( 
  SELECT customer, min(order) as first_order FROM orders GROUP BY customer
) AS first_orders ON orders.order=first_orders.first_order;

我希望有一个简单的技巧我们没有发现,否则这将会是快1000倍的速度来完成

CREATE TEMPORARY TABLE tmp_first_order AS 
  SELECT customer, min(order) as first_order FROM orders GROUP BY customer;
CREATE INDEX tmp_boost ON tmp_first_order (first_order)

SELECT order, first_order FROM orders LEFT JOIN tmp_first_order 
  ON orders.order=tmp_first_order.first_order;

编辑
受@ruakh提出的第三个选项的启发,确实有一种不太丑陋的解决方法,使用INNER JOINUNION,具有可接受的性能,但不需要临时表。然而,它有点特定于我们的情况,我想知道是否存在更通用的优化。

SELECT order, "YES" as first FROM orders INNER JOIN ( 
    SELECT min(order) as first_order FROM orders GROUP BY customer
  ) AS first_orders_1 ON orders.order=first_orders_1.first_order
UNION
SELECT order, "NO" as first FROM orders INNER JOIN ( 
    SELECT customer, min(order) as first_order FROM orders GROUP BY customer
  ) AS first_orders_2 ON first_orders_2.customer = orders.customer 
    AND orders.order > first_orders_2.first_order;

几个想法:分析执行计划(解释查询);一个索引;一个子查询代替左连接。 - full.stack.ex
Kristox,你检查了我的答案吗? - Peter Lang
2个回答

3
以下是您可以尝试的几件事情:
  1. Removing customer from the subquery's field-list, since it's not doing anything anyway:

    SELECT order,
           first_order
      FROM orders
      LEFT
      JOIN ( SELECT MIN(order) AS first_order
               FROM orders
              GROUP
                 BY customer
           ) AS first_orders
        ON orders.order = first_orders.first_order
    ;
    
  2. Conversely, adding customer to the ON clause, so it actually does something for you:

    SELECT order,
           first_order
      FROM orders
      LEFT
      JOIN ( SELECT customer,
                    MIN(order) AS first_order
               FROM orders
              GROUP
                 BY customer
           ) AS first_orders
        ON orders.customer = first_orders.customer
       AND orders.order = first_orders.first_order
    ;
    
  3. Same as previous, but using an INNER JOIN instead of a LEFT JOIN, and converting your original ON clause into a CASE expression:

    SELECT order,
           CASE WHEN first_order = order THEN first_order END AS first_order
      FROM orders
     INNER
      JOIN ( SELECT customer,
                    MIN(order) AS first_order
               FROM orders
              GROUP
                 BY customer
           ) AS first_orders
        ON orders.customer = first_orders.customer
    ;
    
  4. Replacing the whole JOIN approach with an uncorrelated IN-subquery in a CASE expression:

    SELECT order,
           CASE WHEN order IN
                      ( SELECT MIN(order)
                          FROM orders
                         GROUP
                            BY customer
                      )
                THEN order
            END AS first_order
      FROM orders
    ;
    
  5. Replacing the whole JOIN approach with a correlated EXISTS-subquery in a CASE expression:

    SELECT order,
           CASE WHEN NOT EXISTS
                      ( SELECT 1
                          FROM orders AS o2
                         WHERE o2.customer = o1.customer
                           AND o2.order < o1.order
                      )
                THEN order
            END AS first_order
      FROM orders AS o1
    ;
    
(很有可能以上一些实际表现会更差,但我认为它们都值得尝试。)

很好的回答@ruakh。选项3很有趣,但是在你的例子中它只会返回前100个订单。也就是说,如果你有100个客户和2000个订单,那么这只会返回前100个订单。受到你的建议的启发,我尝试了一些使用UNION的东西,看起来似乎可以工作。 - kristox
@kristox:关于“如果你有100个客户和2000个订单,那么[选项3]只会返回前100个订单”的说法是不正确的。你确定你复制了ON子句吗? - ruakh
@ruakh 你是对的,我错了。尝试了选项3,效果很好。谢谢。 - kristox

1
我希望使用变量而不是LEFT JOIN时,速度会更快:
SELECT
  `order`,
  If(@previous_customer<>(@previous_customer:=`customer`),
    `order`,
    NULL
  ) AS first_order
FROM orders
JOIN ( SELECT @previous_customer := -1 ) x
ORDER BY customer, `order`;

这是我在SQL Fiddle上的示例返回结果:
CUSTOMER    ORDER    FIRST_ORDER
1           1        1
1           2        (null)
1           3        (null)
2           4        4
2           5        (null)
3           6        6
4           7        7

《MySQL参考手册》的第9.4节建议不要“在同一语句中为用户变量赋值并读取该值”,因为无法保证它始终会给出您期望的结果(在MySQL版本更改、执行计划更改等情况下)。 - ruakh

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接