红移(Redshift) SQL 查询 - 优化

4

我有一个查询在Redshift上执行需要超过15分钟。这个查询是由AWS Lambda触发的,而Lambda的超时时间是15分钟。因此,我想检查是否有一种方法来优化查询以使其快速返回结果。

以下是我的SQL查询:

 insert into
  test.qa_locked
select
  '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481',
  'ABC-013505',
  'ABC-013505-2-2020',
  user_id,
  cast(TIMEOFDAY() as timestamp)
from
  (
    select
      user_id
    from
      (
                select
                  contact_id
                from
                  test.qa_locked
          )
        where
          contact_cnt <= 1
      )
  )

以下是计划:

XN Subquery Scan "*SELECT*" (cost=1000028198481.69..1000028198481.75 rows=1 width=218)
     ->  XN Subquery Scan derived_table1 (cost=1000028198481.69..1000028198481.73 rows=1 width=210)
         ->  XN Window (cost=1000028198481.69..1000028198481.71 rows=1 width=56)
             ->  XN Sort (cost=1000028198481.69..1000028198481.70 rows=1 width=56)
                 ->  XN Network (cost=1645148.05..28198481.68 rows=1 width=56)
                     ->  XN Hash NOT IN Join DS_DIST_OUTER (cost=1645148.05..28198481.68 rows=1 width=56)
                         ->  XN Hash NOT IN Join DS_DIST_INNER (cost=1645147.76..28091814.71 rows=1 width=56)
                             ->  XN Hash NOT IN Join DS_DIST_INNER (cost=1645147.09..7491814.01 rows=1 width=56)
                                 ->  XN Hash NOT IN Join DS_DIST_INNER (cost=1645146.68..6805146.91 rows=1 width=56)
                                     ->  XN Hash NOT IN Join DS_DIST_INNER (cost=1645146.16..6438479.71 rows=1 width=56)
                                         ->  XN Hash NOT IN Join DS_DIST_NONE (cost=1645145.65..6071812.51 rows=1 width=56)
                                             ->  XN Hash NOT IN Join DS_DIST_NONE (cost=1645145.29..6071812.13 rows=1 width=56)
                                                 ->  XN Hash NOT IN Join DS_DIST_BOTH (cost=1645144.96..6071811.77 rows=1 width=56)
                                                     ->  XN Hash NOT IN Join DS_DIST_NONE (cost=1645144.50..5598477.96 rows=1 width=56)
                                                         ->  XN Hash NOT IN Join DS_DIST_BOTH (cost=1645144.47..5598477.91 rows=1 width=84)
                                                             ->  XN Hash NOT IN Join DS_DIST_OUTER (cost=1645142.59..5078476.00 rows=1 width=84)
                                                                 ->  XN Hash NOT IN Join DS_BCAST_INNER (cost=1645142.57..4065142.63 rows=1 width=600)
                                                                     ->  XN Hash Left Join DS_DIST_BOTH (cost=1201145.21..3221145.24 rows=1 width=1116)
                                                                         ->  XN Seq Scan on contacts xa (cost=1201145.21..1201145.21 rows=1 width=640)
                                                                         ->  XN Hash (cost=0.00..0.00 rows=1 width=556)
                                                                             ->  XN Seq Scan on accounts ya (cost=0.00..0.00 rows=1 width=556)
                                                                     ->  XN Hash (cost=443997.35..443997.35 rows=1 width=32)
                                                                         ->  XN Subquery Scan "IN_subquery" (cost=23989.76..443997.35 rows=1 width=32)
                                                                             ->  XN Unique (cost=23989.76..443997.34 rows=1 width=516)
                                                                                 ->  XN Nested Loop DS_BCAST_INNER (cost=23989.76..443997.34 rows=1 width=516)
                                                                                     ->  XN Seq Scan on accounts con (cost=0.00..0.00 rows=1 width=516)
                                                                                     ->  XN Hash NOT IN Join DS_DIST_OUTER (cost=23989.76..83997.32 rows=1 width=26)
                                                                                         ->  XN Seq Scan on campaign_exclusion_list cam (cost=0.00..7.53 rows=1 width=26)
                                                                                         ->  XN Hash (cost=23989.75..23989.75 rows=1 width=32)
                                                                                             ->  XN Subquery Scan "IN_subquery" (cost=0.00..23989.75 rows=1 width=32)
                                                                                                 ->  XN Unique (cost=0.00..23989.74 rows=1 width=18)
                                                                                                     ->  XN Seq Scan on campaign_inclusion_list (cost=0.00..23989.74 rows=1 width=18)
                                                                 ->  XN Hash (cost=0.01..0.01 rows=1 width=516)
                                                                     ->  XN Subquery Scan "IN_subquery" (cost=0.00..0.01 rows=1 width=516)
                                                                         ->  XN Unique (cost=0.00..0.00 rows=1 width=516)
                                                                             ->  XN Seq Scan on contacts (cost=0.00..0.00 rows=1 width=516)
                                                             ->  XN Hash (cost=1.88..1.88 rows=1 width=210)
                                                                 ->  XN Seq Scan on bh_email_open_clicks (cost=0.00..1.88 rows=1 width=210)
                                                         ->  XN Hash (cost=0.01..0.01 rows=1 width=210)
                                                             ->  XN Subquery Scan "IN_subquery" (cost=0.00..0.01 rows=1 width=210)
                                                                 ->  XN Unique (cost=0.00..0.00 rows=1 width=28)
                                                                     ->  XN Seq Scan on contacts (cost=0.00..0.00 rows=1 width=28)
                                                     ->  XN Hash (cost=0.45..0.45 rows=1 width=210)
                                                         ->  XN Seq Scan on bh_leads (cost=0.00..0.45 rows=1 width=210)
                                                 ->  XN Hash (cost=0.32..0.32 rows=1 width=402)
                                                     ->  XN Subquery Scan "IN_subquery" (cost=0.30..0.32 rows=1 width=402)
                                                         ->  XN HashAggregate (cost=0.30..0.31 rows=1 width=402)
                                                             ->  XN Seq Scan on campaign_extraction_history (cost=0.00..0.30 rows=1 width=402)
                                             ->  XN Hash (cost=0.35..0.35 rows=1 width=402)
                                                 ->  XN Subquery Scan "IN_subquery" (cost=0.33..0.35 rows=1 width=402)
                                                     ->  XN HashAggregate (cost=0.33..0.34 rows=1 width=402)
                                                         ->  XN Seq Scan on campaign_extraction_history (cost=0.00..0.33 rows=1 width=402)
                                         ->  XN Hash (cost=0.50..0.50 rows=1 width=210)
                                             ->  XN Seq Scan on bh_leads (cost=0.00..0.50 rows=1 width=210)
                                     ->  XN Hash (cost=0.50..0.50 rows=1 width=210)
                                         ->  XN Seq Scan on bh_leads (cost=0.00..0.50 rows=1 width=210)
                                 ->  XN Hash (cost=0.40..0.40 rows=1 width=402)
                                     ->  XN Seq Scan on campaign_extraction_history (cost=0.00..0.40 rows=1 width=402)
                             ->  XN Hash (cost=0.30..0.30 rows=30 width=402)
                                 ->  XN Seq Scan on ce_locked_records_tb (cost=0.00..0.30 rows=30 width=402)
                         ->  XN Hash (cost=0.27..0.27 rows=1 width=210)
                             ->  XN Subquery Scan "IN_subquery" (cost=0.26..0.27 rows=1 width=210)
                                 ->  XN HashAggregate (cost=0.26..0.26 rows=1 width=210)
                                     ->  XN Seq Scan on bh_leads (cost=0.00..0.25 rows=1 width=210)

请建议是否有任何方法可以优化此查询。


1
这个查询是手写的还是由某个BI工具生成的?首先要注意的是它非常复杂,有所有那些子查询。最好能够摆脱它们。另一件事是它有26个“not in”运算符,这对效率非常不利。一个“not in”需要选择大量的数据,然后检查所需的值是否在每个返回的行中。这使得在任何数据库中都变得非常缓慢。它们似乎也负责大部分“cost”计算。该查询中还有31个子查询(SELECT)。 - John Rotenstein
@JohnRotenstein 该查询是使用存储过程生成的。 所有代码均手写。有没有其他有效的替代方式来避免使用not in? - dang
一个 NOT IN 经常可以被替换为一个 LEFT OUTER JOIN。然后,确认一个连接字段是否为空。关于这个问题在互联网上有很多讨论,例如:SQL performance on LEFT OUTER JOIN vs NOT EXISTSConsider using NOT EXISTS instead of NOT IN with a subquery - Redgate SoftwareNOT IN vs. NOT EXISTS vs. OUTER APPLY vs. OUTER JOIN - John Rotenstein
@JohnRotenstein 好的,对于 NOT IN 创建表是不可能的,因为该查询是基于请求执行的。此查询用于新作业的创建,我们必须使用表中最新的可用数据。您认为查询中是否有一些快速优化?例如,我们可以将 not in 更改为 not exist,这样会更好吗?或者还有其他小的调整吗? - dang
1
EXPLAIN计划显示了一个“cost”数字。你应该专注于减少或消除高成本。此外,高成本可能是由“DS_DIST_INNER”和“DS_DIST_BOTH”活动引起的。这些通常可以通过共享相同的“DISTKEY”表或在所有节点上复制表来避免。请参见:评估查询计划 - Amazon Redshift - John Rotenstein
显示剩余2条评论
1个回答

4
这似乎是一个不断添加查询的问题,有很多代码重复和许多不必要的表扫描。
请注意,我的主要经验是与MSSQL而不是redshift相关,但对于大多数情况都适用相同的原则。
 (
              lower(xa.primary_function) in (
                select
                  lower(param_val)
                from
                  ce_campaign_spec_tb
                where
                  job_id = '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481'
                  and param = 'primary_function'
                  and relation_id = 4
              )
              and lower(xa.role) in (
                select
                  lower(param_val)
                from
                  ce_campaign_spec_tb
                where
                  job_id = '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481'
                  and param = 'role'
                  and relation_id = 4
              )
              and lower(xa.title) in (
                select
                  lower(title)
                from
                  contacts con
                  inner join ce_campaign_spec_tb camp on lower(con.title) ilike '%' || trim(
                    both ' '
                    from
                      camp.param_val
                  ) || '%'
                where
                  job_id = '1d8db587-f5ab-41f4-9c2b-c4e21e0c7481'
                  and param = 'title'
                  and relation_id = 4
              )
            )

您好,这段代码重复了5次,只有relation_id不同。但是除了id之外似乎没有任何变化。可能存在细微差别,但现在您需要扫描5次表格而不是一次单个谓词的扫描。根据表格的大小,这可能是相当大量的数据。

再往下几行:

and xa.contact_id not in (
            select
              contact_id
            from
              bh_leads
            where
              (CURRENT_DATE - creation_date :: date) <= 60
              and UPPER(LOB) = 'ABC'
              and agency_id = '1002'
          )
          and xa.contact_id not in (
            select
              contact_id
            from
              bh_leads
            where
              (CURRENT_DATE - creation_date :: date) <= 60
              and UPPER(LOB) = 'ABC'
              and sponsor_id = '8306'
          )

再次对同一数据进行了两次表扫描,唯一的区别是一个检查赞助商ID是否有值,另一个检查代理ID。这可以在一条语句中完成而不是两条。

更进一步:

and email_id not in (
            select
              distinct email_id
            from
              contacts
            where
              is_email_suppressed = 1
          )

之前提到过联系人 (xa),并将其作为谓词放入 WHERE 子句中:

and xa.is_email_suppressed = 0

如果不知道所涉及的表的确切模式,我无法确定,但它们似乎做了大致相同的事情。

此外,在这里的Redshift文档中:https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

看起来您可以为单个会话创建临时表。大多数子查询都可以准备好,以便您可以加入结果集。例如,如果您首先准备一个针对具有有效结果的campaign_extraction_history表的临时结果集,则可以使用单个左连接替换以下谓词:

              AND contact_id NOT IN (
            select
              contact_id
            from
              campaign_extraction_history
            where
              sf_oms_campaign_id = 'ABC-013505-2-2020'
              and sf_campaign_id = 'ABC-013505'
              and (CURRENT_DATE - creation_date :: date) < 1
              and channel = 'BOTH'
              and (
                UPPER(STATUS) = 'EXTRACTED'
                OR UPPER(STATUS) = 'LAUNCHED'
                OR UPPER(STATUS) = 'CONFIRMED'
              )
          )
          AND contact_id NOT IN (
            select
              contact_id
            from
              campaign_extraction_history
            where
              creation_date :: date = CURRENT_DATE
              and channel = 'BOTH'
              and (
                UPPER(STATUS) = 'EXTRACTED'
                OR UPPER(STATUS) = 'LAUNCHED'
                OR UPPER(STATUS) = 'CONFIRMED'
              )
            group by
              contact_id
            having
              count(*) > 10
          )
          AND contact_id NOT IN (
            select
              contact_id
            from
              campaign_extraction_history
            where
              sf_campaign_id = 'ABC-013505'
              and channel = 'BOTH'
              and (
                UPPER(STATUS) = 'EXTRACTED'
                OR UPPER(STATUS) = 'LAUNCHED'
                OR UPPER(STATUS) = 'CONFIRMED'
              )
            group by
              contact_id
            having
              count(*) >= 3
          )

可能还有其他地方可以将查询合并,只需一次从表中获取数据。例如,在不同语句和子查询的各种位置上排除许多email_id值。它们很可能可以在一个语句中完成。

也许提高性能的最佳方法是问自己查询尝试做什么和排除什么,然后重写整个查询。这可能需要相当大的工作量,但从长远来看可能会更快。


对于您的第一个观点 - 我同意我重复了相同的代码块5次。但是因为我想为相同的relation_id放置AND条件。例如,relation_id = 1应该具有AND中的条件。然后我取下一个由OR分隔的条件。如何修改这个? - dang
用 'and relation_id between 1 and 5' 替换 'and relation_id = 4',或者用另一个查询的最小和最大值来替代1和5。如果有效的ID不总是以1递增,您还可以使用CTE或临时表来构建用于事务中的有效ID(或任何其他子选择)。 - Repr
我需要比较相同关系ID的primary_function、role和function的值。它不能在1到5之间。 - dang
如果您使用1到5之间的关系ID,您将获得5个结果。然后,您可以找到所有列匹配(主要功能、角色和功能)的单行。 - Repr

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接