PostgreSQL - 慢查询在一个视图上进行连接

7

我想要在一个表(players)和视图(player_main_colors)之间进行简单的连接:

SELECT P.*, C.main_color FROM players P
    OUTER LEFT JOIN player_main_colors C USING (player_id)
    WHERE P.user_id=1;

这个查询需要大约40毫秒。

在这里我使用了对视图的嵌套SELECT而不是JOIN:

SELECT player_id, main_color FROM player_main_colors
    WHERE player_id IN (
        SELECT player_id FROM players WHERE user_id=1);

这个查询也需要大约40毫秒。

当我把查询分成两个部分时,它变得像我预期的那样快:

SELECT player_id FROM players WHERE user_id=1;

SELECT player_id, main_color FROM player_main_colors
    where player_id in (584, 9337, 11669, 12096, 13651,
        13852, 9575, 23388, 14339, 500, 24963, 25630,
        8974, 13048, 11904, 10537, 20362, 9216, 4747, 25045);

每个查询需要约0.5毫秒。

那么为什么上述带有JOIN或子查询的查询非常缓慢,我该如何解决?

以下是我的表和视图的一些详细信息:

CREATE TABLE users (
    user_id INTEGER PRIMARY KEY,
    ...
)

CREATE TABLE players (
    player_id INTEGER PRIMARY KEY,
    user_id INTEGER NOT NULL REFERENCES users (user_id),
    ...
)

CREATE TABLE player_data (
    player_id INTEGER NOT NULL REFERENCES players (player_id),
    game_id INTEGER NOT NULL,
    color INTEGER NOT NULL,
    PRIMARY KEY (player_id, game_id, color),
    active_time INTEGER DEFAULT 0,
    ...
)

CREATE VIEW player_main_colors AS
    SELECT DISTINCT ON (1) player_id, color as main_color
        FROM player_data
        GROUP BY player_id, color
        ORDER BY 1, MAX(active_time) DESC

看起来我的问题可能与视图有关...?

以下是上述嵌套SELECT查询的EXPLAIN ANALYZE:

Merge Semi Join  (cost=1877.59..2118.00 rows=6851 width=8) (actual time=32.946..38.471 rows=25 loops=1)
   Merge Cond: (player_data.player_id = players.player_id)
   ->  Unique  (cost=1733.19..1801.70 rows=13701 width=12) (actual time=32.651..37.209 rows=13419 loops=1)
         ->  Sort  (cost=1733.19..1767.45 rows=13701 width=12) (actual time=32.646..34.918 rows=16989 loops=1)
               Sort Key: player_data.player_id, (max(player_data.active_time))
               Sort Method: external merge  Disk: 376kB
               ->  HashAggregate  (cost=654.79..791.80 rows=13701 width=12) (actual time=13.636..19.051 rows=17311 loops=1)
                     ->  Seq Scan on player_data  (cost=0.00..513.45 rows=18845 width=12) (actual time=0.005..1.801 rows=18845 loops=1)
   ->  Sort  (cost=144.40..144.53 rows=54 width=8) (actual time=0.226..0.230 rows=54 loops=1)
         Sort Key: players.player_id
         Sort Method: quicksort  Memory: 19kB
         ->  Bitmap Heap Scan on players  (cost=4.67..142.85 rows=54 width=8) (actual time=0.035..0.112 rows=54 loops=1)
               Recheck Cond: (user_id = 1)
               ->  Bitmap Index Scan on test  (cost=0.00..4.66 rows=54 width=0) (actual time=0.023..0.023 rows=54 loops=1)
                     Index Cond: (user_id = 1)
 Total runtime: 39.279 ms

关于索引,除了我的主键默认索引之外,我只有一个相关的索引:

CREATE INDEX player_user_idx ON players (user_id);

我目前正在使用PostgreSQL 9.2.9。

更新:

我已经简化了问题。请查看 IN (4747) 和 IN (SELECT 4747) 之间的区别。

慢:

>> explain analyze SELECT * FROM (
          SELECT player_id, color 
            FROM player_data
            GROUP BY player_id, color
            ORDER BY MAX(active_time) DESC
       ) S
       WHERE player_id IN (SELECT 4747);

 Hash Join  (cost=1749.99..1975.37 rows=6914 width=8) (actual time=30.492..34.291 rows=4 loops=1)
   Hash Cond: (player_data.player_id = (4747))
   ->  Sort  (cost=1749.95..1784.51 rows=13827 width=12) (actual time=30.391..32.655 rows=17464 loops=1)
         Sort Key: (max(player_data.active_time))
         Sort Method: external merge  Disk: 376kB
         ->  HashAggregate  (cost=660.71..798.98 rows=13827 width=12) (actual time=12.714..17.249 rows=17464 loops=1)
               ->  Seq Scan on player_data  (cost=0.00..518.12 rows=19012 width=12) (actual time=0.006..1.898 rows=19012 loops=1)
   ->  Hash  (cost=0.03..0.03 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 1kB
         ->  HashAggregate  (cost=0.02..0.03 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=1)
               ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
 Total runtime: 35.015 ms
(12 rows)

Time: 35.617 ms

快速:

>> explain analyze SELECT * FROM (
          SELECT player_id, color 
            FROM player_data
            GROUP BY player_id, color
            ORDER BY MAX(active_time) DESC
       ) S
       WHERE player_id IN (4747);

 Subquery Scan on s  (cost=17.40..17.45 rows=4 width=8) (actual time=0.035..0.035 rows=4 loops=1)
   ->  Sort  (cost=17.40..17.41 rows=4 width=12) (actual time=0.034..0.034 rows=4 loops=1)
         Sort Key: (max(player_data.active_time))
         Sort Method: quicksort  Memory: 17kB
         ->  GroupAggregate  (cost=0.00..17.36 rows=4 width=12) (actual time=0.020..0.027 rows=4 loops=1)
               ->  Index Scan using player_data_pkey on player_data  (cost=0.00..17.28 rows=5 width=12) (actual time=0.014..0.019 rows=5 loops=1)
                     Index Cond: (player_id = 4747)
 Total runtime: 0.080 ms
(8 rows)

Time: 0.610 ms

1
你尝试过使用 exists 查询吗?... FROM player_main_colors p1 WHERE exists (SELECT 1 FROM players p2 where p2.player_id = p1.player_id and p2.user_id=1) - user330315
我之前没有尝试过,但看起来也需要40毫秒。 - user202987
1
你最近分析过所有相关的表格了吗? - Mark Roberts
"explain analyze" 在“最优”0.5毫秒的情况下寻找什么?我见过优化器能够根据显式参数而不是“隐含”参数(所有用户,其中user_id = 1)制定更好的计划的情况。 - Mark Roberts
1
排序方法:外部合并 磁盘:376kB,请问您能否展示一下您的配置?特别是work_mem,看起来它使用的设置非常低,不足以在内存中进行排序。 - Frank Heikens
显示剩余4条评论
2个回答

9

在您的 VIEW 定义中同时使用了 GROUP BYDISTINCT ON,这就像射击一个死人。为了简化操作,请考虑:

CREATE VIEW player_main_colors AS
SELECT DISTINCT ON (player_id)
       player_id, color AS main_color
FROM   player_data
ORDER  BY player_id, active_time DESC NULLS LAST;

NULLS LAST是必需的,以便与您的原始内容相等,因为根据您的表定义,active_time可能为空。这样应该更快。但是还有更多。为了获得最佳性能,请创建以下索引

CREATE INDEX players_up_idx ON players (user_id, player_id);
CREATE INDEX players_data_pa_idx ON player_data
    (player_id, active_time DESC NULLS LAST, color);

使用DESC NULLS LAST在索引中,以便与查询的排序顺序相匹配。或者你可以将列player_data.active_time声明为NOT NULL并简化所有内容。

应该用LEFT OUTER JOIN而不是OUTER LEFT JOIN, 或者只需省略噪声单词OUTER

SELECT *  -- equivalent here to "p.*, c.main_color"
FROM   players p
LEFT   JOIN player_main_colors c USING (player_id)
WHERE  p.user_id = 1;

我假设每个player_idplayer_data表中都有大量的行。而你只选择了少数player_id。对于这种情况,JOIN LATERAL 是最快的方法,但你需要使用Postgres 9.3或更高版本才能使用它。在pg 9.2中,你可以通过相关子查询来实现类似的效果:

CREATE VIEW player_main_colors AS
SELECT player_id
    , (SELECT color 
       FROM   player_data
       WHERE  player_id = p.player_id
       ORDER  BY active_time DESC NULLS LAST
       LIMIT  1) AS main_color
FROM   players p
ORDER  BY 1;  -- optional

与原始视图相比,有一个微小的区别:这包括没有在player_data中有任何条目的玩家。您可以基于新视图尝试与上面相同的查询。但是我不建议使用视图。这可能是最快的

SELECT *
    , (SELECT color 
       FROM   player_data
       WHERE  player_id = p.player_id
       ORDER  BY active_time DESC NULLS LAST
       LIMIT  1) AS main_color
FROM   players p
WHERE  p.user_id = 1;

详细解释:


1
非常感谢您提供的重要信息。在我的环境中:不带相关子查询的VIEW:20毫秒,带相关子查询的VIEW:90毫秒,直接查询:1毫秒。我决定暂时在玩家表中维护一个main_color列,因为这是实用的,并且将减少多个查询的复杂性。 - user202987
@user202987:维护冗余列也有各种成本。它使写入更加昂贵,表格变得更大,并引入额外的索引,从而降低了每个索引的效益。利弊参半。鉴于直接查询的压倒性性能,我会使用它。 - Erwin Brandstetter

0

因此,这种行为的原因是查询规划器存在限制。在特定的绑定参数情况下,查询规划器能够根据它所看到和分析的查询制定具体的计划。然而,当发生连接和子查询时,对将要发生的事情的可见性要少得多。这使得优化器使用更加“通用”的计划-在这种情况下,这个计划明显较慢。

对你来说正确的答案似乎是进行两次选择。也许一个更好的答案是将“main_color”非规范化到你的玩家表中,并定期更新它。


1
两个查询语句是一个较差的解决方案。单次调用数据库通常更快。通过改进查询和索引,去规范化很可能是必要的。最后,连接和子查询对于查询计划器来说根本不是问题。预处理语句必须为任何可能的参数值进行准备,这可能会强制使用更通用的查询计划。 - Erwin Brandstetter

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接