当按照来自两个不同表的列进行排序时，PostgreSQL查询速度变慢很多。

Question

当按照来自两个不同表的列进行排序时，PostgreSQL查询速度变慢很多。

5

以前我使用的这个查询非常快：

cb=# explain analyze SELECT "web_route"."id", "web_crag"."id" FROM "web_route" 
INNER JOIN "web_crag" ON ( "web_route"."crag_id" = "web_crag"."id" )
WHERE "web_crag"."type" IN (1, 2) 
ORDER BY "web_crag"."name" ASC
LIMIT 20;
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..2.16 rows=20 width=18) (actual time=0.027..0.105 rows=20 loops=1)
   ->  Nested Loop  (cost=0.00..47088.94 rows=436055 width=18) (actual time=0.026..0.100 rows=20 loops=1)
         ->  Index Scan using web_crag_name on web_crag  (cost=0.00..503.16 rows=1776 width=14) (actual time=0.011..0.020 rows=14 loops=1)
               Filter: (type = ANY ('{1,2}'::integer[]))
         ->  Index Scan using web_route_crag_id on web_route  (cost=0.00..23.27 rows=296 width=8) (actual time=0.004..0.005 rows=1 loops=14)
               Index Cond: (crag_id = web_crag.id)
 Total runtime: 0.154 ms
(7 rows)

查询的问题在于返回的行的顺序是不确定的，这导致了OFFSET（即分页）在我的Web应用程序中产生了重复的行。为了解决这个问题，我决定通过额外按“web_route” .id进行排序来使排序变得严格。

cb=# explain analyze SELECT "web_route"."id", "web_crag"."id" FROM "web_route" 
INNER JOIN "web_crag" ON ( "web_route"."crag_id" = "web_crag"."id" )
WHERE "web_crag"."type" IN (1, 2)
ORDER BY "web_crag"."name", "web_route"."id" ASC 
LIMIT 20;
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=29189.04..29189.09 rows=20 width=18) (actual time=324.065..324.068 rows=20 loops=1)
   ->  Sort  (cost=29189.04..30279.18 rows=436055 width=18) (actual time=324.063..324.064 rows=20 loops=1)
         Sort Key: web_crag.name, web_route.id
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Hash Join  (cost=135.40..17585.78 rows=436055 width=18) (actual time=0.882..195.941 rows=435952 loops=1)
               Hash Cond: (web_route.crag_id = web_crag.id)
               ->  Seq Scan on web_route  (cost=0.00..10909.55 rows=436055 width=8) (actual time=0.026..55.916 rows=435952 loops=1)
               ->  Hash  (cost=113.20..113.20 rows=1776 width=14) (actual time=0.848..0.848 rows=1775 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 82kB
                     ->  Seq Scan on web_crag  (cost=0.00..113.20 rows=1776 width=14) (actual time=0.004..0.510 rows=1775 loops=1)
                           Filter: (type = ANY ('{1,2}'::integer[]))
 Total runtime: 324.101 ms
(12 rows)

然而，正如您所看到的，查询速度变慢了2000倍，这相当严重 :). 如果有什么方法可以解决这个问题，我想知道。我的计划是进行一次不好的黑客攻击，并将“web_crag”中的“name”复制到“web_route”中，以便我可以在这两列（crag_name，id）上放置索引，但如果有更好的方法，我会很高兴。

如果有关系，请参考“web_route”和“web_crag”的图表。

cb=# \d web_crag;
                                      Table "public.web_crag"
     Column      |           Type           |                       Modifiers                       
-----------------+--------------------------+-------------------------------------------------------
 id              | integer                  | not null default nextval('web_crag_id_seq'::regclass)
 name            | character varying(64)    | not null
 latitude        | double precision         | 
 longitude       | double precision         | 
 type            | integer                  | 
 description     | text                     | not null
 normalized_name | character varying(64)    | not null
 country_id      | integer                  | 
 location_index  | character(24)            | not null
 added_by_id     | integer                  | 
 date_created    | timestamp with time zone | 
 last_modified   | timestamp with time zone | 
Indexes:
    "web_crag_pkey" PRIMARY KEY, btree (id)
    "web_crag_added_by_id" btree (added_by_id)
    "web_crag_country_id" btree (country_id)
    "web_crag_location_index" btree (location_index)
    "web_crag_name" btree (name)
Foreign-key constraints:
    "added_by_id_refs_id_1745ebe43b31bec6" FOREIGN KEY (added_by_id) REFERENCES web_member(id) DEFERRABLE INITIALLY DEFERRED
    "country_id_refs_id_1384050a9bd763af" FOREIGN KEY (country_id) REFERENCES web_country(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "web_route" CONSTRAINT "crag_id_refs_id_3ce1145606d12740" FOREIGN KEY (crag_id) REFERENCES web_crag(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "web_video" CONSTRAINT "crag_id_refs_id_4fc9cbf2832725ca" FOREIGN KEY (crag_id) REFERENCES web_crag(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "web_image" CONSTRAINT "crag_id_refs_id_58210dd331468848" FOREIGN KEY (crag_id) REFERENCES web_crag(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "web_eventdestination" CONSTRAINT "crag_id_refs_id_612ad57c4d76c32c" FOREIGN KEY (crag_id) REFERENCES web_crag(id) DEFERRABLE INITIALLY DEFERRED
Triggers:
    set_crag_location_index BEFORE INSERT OR UPDATE ON web_crag FOR EACH ROW EXECUTE PROCEDURE set_crag_location_index()

cb=# \d web_route
                                        Table "public.web_route"
       Column       |           Type           |                       Modifiers                        
--------------------+--------------------------+--------------------------------------------------------
 id                 | integer                  | not null default nextval('web_route_id_seq'::regclass)
 name               | character varying(64)    | not null
 crag_id            | integer                  | not null
 sector             | character varying(64)    | not null
 difficulty         | character varying(16)    | not null
 author             | character varying(64)    | not null
 build_date         | character varying(32)    | not null
 description        | text                     | not null
 difficulty_numeric | integer                  | 
 length_meters      | double precision         | 
 added_by_id        | integer                  | 
 date_created       | timestamp with time zone | 
 last_modified      | timestamp with time zone | 
 normalized_name    | character varying(64)    | not null
 rating_votes       | integer                  | not null
 rating_score       | integer                  | not null
Indexes:
    "web_route_pkey" PRIMARY KEY, btree (id)
    "web_route_added_by_id" btree (added_by_id)
    "web_route_crag_id" btree (crag_id)
Check constraints:
    "ck_rating_votes_pstv_c39bae29f3b2012" CHECK (rating_votes >= 0)
    "web_route_rating_votes_check" CHECK (rating_votes >= 0)
Foreign-key constraints:
    "added_by_id_refs_id_157791930f5e12d5" FOREIGN KEY (added_by_id) REFERENCES web_member(id) DEFERRABLE INITIALLY DEFERRED
    "crag_id_refs_id_3ce1145606d12740" FOREIGN KEY (crag_id) REFERENCES web_crag(id) DEFERRABLE INITIALLY DEFERRED

- clime

你的数据历史是什么？你是截断还是从转储中导入的？如果是这样，请尝试更新规划器统计信息或重新创建索引。由于过时的假设，PostgreSQL可能会制定一个糟糕的计划。此外，当你想要分页时，应该在排序条件中包含所有主键，以避免结果排序的任何自由。 - Augustus Kling

@AugustusKling: 我已经运行了 vacuum analyze ，对表 web_route 进行了重新索引，对表 web_crag 进行了重新索引。但是这并没有改变任何东西。我也不完全记得数据来自哪里，它是从csv导入的还是从sql转储中导入的。 - clime

我希望能够更好地估算查询计划，但看起来统计数据已经很好了。你的数据可能包含许多 web_crag 中的 web_routes，并且必须按 web_route.id 进行排序，这可能需要对许多 web_routes 进行排序（或者至少规划者会为此做准备）。我建议尝试 create index on web_route (crag_id asc, id asc) 来收集关于列分布的信息，然后使用 order by web_crag.name, web_route.crag_id, web_route.id，并希望它足够满足需求。 - Augustus Kling

@AugustusKling：我尝试过了，但实际上并没有看到与原始查询相比的任何改进。 - clime

web_crag.name是否真的是唯一的？如果不是，将其包含在其他表中意味着什么？由于web_route（crag_id asc，id asc）上有索引，应该有更好的计划可用，但似乎规划程序无法找到它们。 - jjanes

显示剩余5条评论

2个回答

2

这里的问题在于，您之前可以使用现有索引进行排序。

 ->  Index Scan using web_crag_name on web_crag  (cost=0.00..503.16 rows=1776 width=14)      (actual time=0.011..0.020 rows=14 loops=1)
           Filter: (type = ANY ('{1,2}'::integer[]))

但是，在为排序添加两个不同的列后，需要使用第一张表中的一个索引和第二张表中的另一个索引。如果您想要在分页时使其运行快速，唯一合理的事情就是避免第三范式（即不重复数据），只需在另一张表中复制数据。它可以是 web.crag 表中的 web.route_id 或反之亦然（没有深入研究您的架构），并创建一个组合索引 CREATE INDEX ON table ("web_crag"."name", "web_route"."id"); 当表上存在组合索引且列顺序正确时，扫描时间将与第一个查询一样好。

还要确保您检查 intgr-s 的解决方案。它适用于 9.3+，并能完成任务。材料化与（有点难以阅读的）查询是您的选择。我个人更喜欢材料化，但 intgr 让我学到了新东西。

- kristok

当然，复制数据意味着现在每次更新“web_crag”名称时，您还必须更新所有相关的“web_route”行中的相应值。您可能还会遇到并发问题，例如当一个事务尝试更新名称而另一个插入新路由行时。 - intgr

复制数据总是一种必要的恶，我同意。 - kristok

始终是必要的恶吗？不，我的解决方案不需要复制数据。 - intgr

我的意思是复制数据总是不好的，但当你需要这样做并且它能简化事情时，这就是必要的恶。说实话，我从未使用过Lateral，但我很想看看它产生的解释计划是什么。 - kristok

我更新了我的答案，并附上了在运行一些测试数据时的查询计划。 - intgr

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- intgr · Accepted Answer

很遗憾，PostgreSQL目前还不擅长优化这些类型的排序，如果找不到与排序子句完全匹配的索引，它总是想一次性对整个结果集进行排序。

从PostgreSQL 9.3开始，您可以通过LATERAL subquery来欺骗规划器以执行正确的操作。请尝试此操作：

SELECT "web_route"."id", "web_crag"."id"
FROM "web_crag",
LATERAL (
    SELECT * FROM "web_route"
    WHERE "web_route"."crag_id" = "web_crag"."id"
    ORDER BY "web_route"."id" ASC
) AS "web_route"
WHERE "web_crag"."type" IN (1, 2)
ORDER BY "web_crag"."name"
LIMIT 20;

通过我生成的一些简单测试数据（100万个web_crags，500万个web_routes），这是查询计划和时间... 与您的第一个查询计划几乎相同，除了为web_route.id多出一个排序：

 Limit  (cost=24.36..120.70 rows=20 width=14) (actual time=0.051..0.169 rows=20 loops=1)
   ->  Nested Loop  (cost=24.36..24084788.95 rows=5000000 width=14) (actual time=0.049..0.143 rows=20 loops=1)
         ->  Index Scan using web_crag_name_idx on web_crag  (cost=0.42..39131.46 rows=1000000 width=10) (actual time=0.018..0.023 rows=4 loops=1)
               Filter: (type = ANY ('{1,2}'::integer[]))
         ->  Sort  (cost=23.93..23.95 rows=5 width=8) (actual time=0.018..0.021 rows=5 loops=4)
               Sort Key: web_route.id
               Sort Method: quicksort  Memory: 25kB
               ->  Index Scan using web_route_crag_id_idx on web_route  (cost=0.43..23.88 rows=5 width=8) (actual time=0.005..0.011 rows=5 loops=4)
                     Index Cond: (crag_id = web_crag.id)
 Total runtime: 0.212 ms

您可以通过在web_route（crag_id，id）上添加额外的索引来避免排序：

 Limit  (cost=0.86..19.49 rows=20 width=14) (actual time=0.031..0.113 rows=20 loops=1)
   ->  Nested Loop  (cost=0.86..4659293.82 rows=5000000 width=14) (actual time=0.029..0.084 rows=20 loops=1)
         ->  Index Scan using web_crag_name_idx on web_crag  (cost=0.42..39293.82 rows=1000000 width=10) (actual time=0.017..0.021 rows=4 loops=1)
               Filter: (type = ANY ('{1,2}'::integer[]))
         ->  Index Only Scan using web_route_crag_id_id_idx on web_route  (cost=0.43..4.52 rows=5 width=8) (actual time=0.005..0.009 rows=5 loops=4)
               Index Cond: (crag_id = web_crag.id)
               Heap Fetches: 0
 Total runtime: 0.151 ms

这是我创建测试数据的方法：

create table web_crag(id serial primary key, type int default 1, name text);
create table web_route(id serial primary key, crag_id int);
insert into web_crag (name) select generate_series(1,1000000)::text;
insert into web_route (crag_id) select id from web_crag cross join generate_series(1,5);
create index on web_crag(name);
create index on web_route(crag_id);
analyze web_route;

PostgreSQL补丁

有一个"部分排序"的补丁可以自动进行此类优化，但遗憾的是它未能进入PostgreSQL 9.4。希望PostgreSQL 9.5会有它（大约在2015年下半年）。