如果IN子句中的表为空,Postgresql查询会变慢

3

我有以下SQL语句:

WITH filtered_users_pre as (
  SELECT value as username,row_number() OVER (partition by value) AS rk
    FROM "user-stats".tag_table
    WHERE _at_timestamp = 1626955200
       AND tag in ('commercial','marketing')
  ),

  filtered_users as (
    SELECT username
    FROM filtered_users_pre
    WHERE rk = 2
  ),

  valid_users as (
    SELECT aa.username, aa.rank, aa.points, aa.version
    FROM "users-results".ai_algo aa
    WHERE aa._at_timestamp = 1626955200
          AND aa.rank_timeframe = '7d'
          AND aa.username IN (SELECT * FROM filtered_users)
    ORDER BY aa.rank ASC
    LIMIT 15
    OFFSET 0
  )
select * from valid_users;

""user-stats".tag_table"是一张有大约6000万行的表,拥有适当的索引。 ""users-results".ai_algo"是一张有大约1000万行的表,拥有适当的索引。
适当的索引指的是在WHERE子句中出现的所有字段。
如果filtered_users为空,则查询运行需要4秒。如果filtered_users至少有一行,则需要400毫秒。
有人能解释为什么吗?是否有办法使查询在filtered_users为空时也保持同样的性能(400毫秒)?我希望随着filtered_users行数的减少获得更好的性能。这就是发生在1行以下的情况。当行数为0时,时间要多10倍。
当然,如果在WHERE子句中不使用IN,而改用ai_algofiltered_users之间的INNER JOIN,也会发生同样的情况。
更新: 当filtered_users没有行时(执行4秒),此处是查询的“EXPLAIN (ANALYZE, BUFFERS)”输出内容。"
Limit  (cost=14592.13..15870.39 rows=15 width=35) (actual time=3953.945..3953.949 rows=0 loops=1)
  Buffers: shared hit=7456641
  ->  Nested Loop Semi Join  (cost=14592.13..1795382.62 rows=20897 width=35) (actual time=3953.944..3953.947 rows=0 loops=1)
        Join Filter: (aa.username = filtered_users_pre.username)
        Buffers: shared hit=7456641
        ->  Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa  (cost=0.56..1718018.61 rows=321495 width=35) (actual time=0.085..3885.547 rows=313611 loops=1)
"              Index Cond: (rank_timeframe = '7d'::""valid-users-timeframe"")"
              Filter: (_at_timestamp = 1626955200)
              Rows Removed by Filter: 7793096
              Buffers: shared hit=7456533
        ->  Materialize  (cost=14591.56..14672.51 rows=13 width=21) (actual time=0.000..0.000 rows=0 loops=313611)
              Buffers: shared hit=108
              ->  Subquery Scan on filtered_users_pre  (cost=14591.56..14672.44 rows=13 width=21) (actual time=3.543..3.545 rows=0 loops=1)
                    Filter: (filtered_users_pre.rk = 2)
                    Rows Removed by Filter: 2415
                    Buffers: shared hit=108
                    ->  WindowAgg  (cost=14591.56..14638.74 rows=2696 width=29) (actual time=1.996..3.356 rows=2415 loops=1)
                          Buffers: shared hit=108
                          ->  Sort  (cost=14591.56..14598.30 rows=2696 width=21) (actual time=1.990..2.189 rows=2415 loops=1)
                                Sort Key: tag_table_20210722.value
                                Sort Method: quicksort  Memory: 285kB
                                Buffers: shared hit=108
                                ->  Bitmap Heap Scan on tag_table_20210722  (cost=146.24..14437.94 rows=2696 width=21) (actual time=0.612..1.080 rows=2415 loops=1)
"                                      Recheck Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
                                      Filter: (_at_timestamp = 1626955200)
                                      Rows Removed by Filter: 2415
                                      Heap Blocks: exact=72
                                      Buffers: shared hit=105
                                      ->  Bitmap Index Scan on tag_table_20210722_tag_idx  (cost=0.00..145.57 rows=5428 width=0) (actual time=0.292..0.292 rows=4830 loops=1)
"                                            Index Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
                                            Buffers: shared hit=33
Planning Time: 0.914 ms
Execution Time: 3954.035 ms

当 filtered_users 至少有1行时(300毫秒)

Limit  (cost=14592.13..15870.39 rows=15 width=35) (actual time=15.958..300.759 rows=15 loops=1)
  Buffers: shared hit=11042
  ->  Nested Loop Semi Join  (cost=14592.13..1795382.62 rows=20897 width=35) (actual time=15.957..300.752 rows=15 loops=1)
        Join Filter: (aa.username = filtered_users_pre.username)
        Rows Removed by Join Filter: 1544611
        Buffers: shared hit=11042
        ->  Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa (cost=0.56..1718018.61 rows=321495 width=35) (actual time=0.075..10.455 rows=645 loops=1)
"              Index Cond: (rank_timeframe = '7d'::""valid-users-timeframe"")"
              Filter: (_at_timestamp = 1626955200)
              Rows Removed by Filter: 16124
              Buffers: shared hit=10937
        ->  Materialize  (cost=14591.56..14672.51 rows=13 width=21) (actual time=0.003..0.174 rows=2395 loops=645)
              Buffers: shared hit=105
              ->  Subquery Scan on filtered_users_pre  (cost=14591.56..14672.44 rows=13 width=21) (actual time=1.895..3.680 rows=2415 loops=1)
                    Filter: (filtered_users_pre.rk = 1)
                    Buffers: shared hit=105
                    ->  WindowAgg  (cost=14591.56..14638.74 rows=2696 width=29) (actual time=1.894..3.334 rows=2415 loops=1)
                          Buffers: shared hit=105
                          ->  Sort  (cost=14591.56..14598.30 rows=2696 width=21) (actual time=1.889..2.102 rows=2415 loops=1)
                                Sort Key: tag_table_20210722.value
                                Sort Method: quicksort  Memory: 285kB
                                Buffers: shared hit=105
                                ->  Bitmap Heap Scan on tag_table_20210722  (cost=146.24..14437.94 rows=2696 width=21) (actual time=0.604..1.046 rows=2415 loops=1)
"                                      Recheck Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
                                      Filter: (_at_timestamp = 1626955200)
                                      Rows Removed by Filter: 2415
                                      Heap Blocks: exact=72
                                      Buffers: shared hit=105
                                      ->  Bitmap Index Scan on tag_table_20210722_tag_idx  (cost=0.00..145.57 rows=5428 width=0) (actual time=0.287..0.287 rows=4830 loops=1)
"                                            Index Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
                                            Buffers: shared hit=33
Planning Time: 0.310 ms
Execution Time: 300.954 ms

没有看到EXPLAIN(ANALYZE,BUFFERS)输出,没有人能回答这个问题。顺便说一句,对所有东西进行索引并不是正确的索引方式。 - Laurenz Albe
我还没有对所有内容进行索引,但我已经对搜索涉及的所有字段进行了索引,这就是我的意思。我很快会提供EXPLAIN。 - 70ny
根据@LaurenzAlbe的建议,以上答案已更新为EXPLAIN(ANALYZE,BUFFERS) - 70ny
1个回答

1
问题在于如果没有匹配的filtered_users,PostgreSQL必须在未找到15个结果行的情况下遍历所有"users-results".ai_algo。如果子查询包含元素,则可以快速找到15个匹配的"users-results".ai_algo行并终止处理。
你无法改变这一点,但可以加快"users-results".ai_algo的扫描速度。目前,你有
->  Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa
                              ... (actual time=0.085..3885.547 rows=313611 loops=1)
      Index Cond: (rank_timeframe = '7d'::"valid-users-timeframe")
      Filter: (_at_timestamp = 1626955200)
      Rows Removed by Filter: 7793096
      Buffers: shared hit=7456533

你会发现索引扫描并不像它本应该的那样有效:它从表中读取了313611 + 7793096 = 8106707行,并且只保留符合过滤条件的313611行。
通过创建一个可以直接找到结果行的索引,你可以做得更好。
CREATE INDEX ON "users-results".ai_algo (rank_timeframe, _at_timestamp);

然后,你可以删除索引ai_algo_rank_timeframe_rank_idx,因为新的索引能够完成旧索引的所有任务(甚至更多)。

谢谢,你建议的索引确实解决了问题(现在查询只需要350毫秒,即使没有返回任何行)。这样的答案对于理解逻辑非常有帮助! - 70ny

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接