PostgreSQL中的字典排序非常缓慢?

6

我有一个名为vote_pairs的视图,它看起来像这样:

CREATE VIEW vote_pairs AS
    SELECT
        v1.name as name1,
        v2.name as name2,
        ...
    FROM votes AS v1
    JOIN votes AS v2
        ON v1.topic_id = v2.topic_id;

当使用视图进行查询时,votes表中有大约100000行数据,这些查询需要约3秒钟才能执行。

但是,当我在名称上添加一个额外的过滤器时:

ON v1.topic_id = v2.topic_id AND v1.name < v2.name;

运行时间增加了四倍,针对vote_pairs的查询需要近12秒才能完成。

无论限制条件的位置如何,运行时间都是一致的……例如,如果将过滤器移动到外部查询的WHERE子句中,查询同样会很慢:

SELECT * FROM vote_pairs WHERE name1 < name2;

发生了什么?在Postgres中进行词典排序比较慢吗?还是其他原因导致的?我该如何提高此查询的速度?

投票表:

CREATE TABLE votes (
    topic_id INTEGER REFERENCES topics(id),
    name VARCHAR(64),
    vote VARCHAR(12)
)

CREATE INDEX votes_topic_name ON votes (topic_id, name);
CREATE INDEX votes_name ON votes (name);
EXPLAIN ANALYZE 没有名称过滤器的输出结果为:
db=# CREATE OR REPLACE VIEW vote_pairs AS
db-#     SELECT
db-#         v1.name as name1,
db-#         v2.name as name2
db-#     FROM votes AS v1
db-#     JOIN votes AS v2
db-#         ON v1.topic_id = v2.topic_id;
CREATE VIEW
db=# EXPLAIN ANALYZE SELECT * FROM vote_pairs;                                                                                                                                                                                                                           QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=3956.38..71868.56 rows=5147800 width=28) (actual time=51.810..1236.673 rows=5082750 loops=1)
   Hash Cond: (v1.topic_id = v2.topic_id)
   ->  Seq Scan on votes v1  (cost=0.00..1882.50 rows=112950 width=18) (actual time=0.019..18.358 rows=112950 loops=1)
   ->  Hash  (cost=1882.50..1882.50 rows=112950 width=18) (actual time=50.671..50.671 rows=112950 loops=1)
         ->  Seq Scan on votes v2  (cost=0.00..1882.50 rows=112950 width=18) (actual time=0.004..20.306 rows=112950 loops=1)
 Total runtime: 1495.963 ms
(6 rows)

使用过滤器:

db=# CREATE OR REPLACE VIEW vote_pairs AS
db-#     SELECT
db-#         v1.name as name1,
db-#         v2.name as name2
db-#     FROM votes AS v1
db-#     JOIN votes AS v2
db-#         ON v1.topic_id = v2.topic_id AND v1.name < v2.name;
CREATE VIEW
db=# EXPLAIN ANALYZE SELECT * FROM vote_pairs;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=3956.38..84738.06 rows=1715933 width=28) (actual time=66.688..6900.478 rows=2484900 loops=1)
   Hash Cond: (v1.topic_id = v2.topic_id)
   Join Filter: ((v1.name)::text < (v2.name)::text)
   ->  Seq Scan on votes v1  (cost=0.00..1882.50 rows=112950 width=18) (actual time=0.023..24.539 rows=112950 loops=1)
   ->  Hash  (cost=1882.50..1882.50 rows=112950 width=18) (actual time=65.603..65.603 rows=112950 loops=1)
         ->  Seq Scan on votes v2  (cost=0.00..1882.50 rows=112950 width=18) (actual time=0.004..26.756 rows=112950 loops=1)
 Total runtime: 7048.740 ms
(7 rows)

解释(分析,缓冲):

db=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vote_pairs;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=3956.38..71345.89 rows=5152008 width=28) (actual time=56.230..1204.522 rows=5082750 loops=1)
   Hash Cond: (v1.topic_id = v2.topic_id)
   Buffers: shared hit=129 read=1377 written=2, temp read=988 written=974
   ->  Seq Scan on votes v1  (cost=0.00..1882.50 rows=112950 width=18) (actual time=0.008..20.492 rows=112950 loops=1)
         Buffers: shared hit=77 read=676
   ->  Hash  (cost=1882.50..1882.50 rows=112950 width=18) (actual time=55.742..55.742 rows=112950 loops=1)
         Buckets: 2048  Batches: 8  Memory Usage: 752kB
         Buffers: shared hit=52 read=701 written=2, temp written=480
         ->  Seq Scan on votes v2  (cost=0.00..1882.50 rows=112950 width=18) (actual time=0.004..22.954 rows=112950 loops=1)
               Buffers: shared hit=52 read=701 written=2
 Total runtime: 1499.302 ms
(11 rows)


db=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM vote_pairs WHERE name1 > name2;                                              
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=3956.38..84225.91 rows=1717336 width=28) (actual time=51.214..6422.592 rows=2484900 loops=1)
   Hash Cond: (v1.topic_id = v2.topic_id)
   Join Filter: ((v1.name)::text > (v2.name)::text)
   Rows Removed by Join Filter: 2597850
   Buffers: shared hit=32 read=1477, temp read=988 written=974
   ->  Seq Scan on votes v1  (cost=0.00..1882.50 rows=112950 width=18) (actual time=0.008..22.605 rows=112950 loops=1)
         Buffers: shared hit=27 read=726
   ->  Hash  (cost=1882.50..1882.50 rows=112950 width=18) (actual time=50.678..50.678 rows=112950 loops=1)
         Buckets: 2048  Batches: 8  Memory Usage: 752kB
         Buffers: shared hit=2 read=751, temp written=480
         ->  Seq Scan on votes v2  (cost=0.00..1882.50 rows=112950 width=18) (actual time=0.005..21.337 rows=112950 loops=1)
               Buffers: shared hit=2 read=751
 Total runtime: 6573.308 ms
(13 rows)

注意事项:

  • VACCUM FULLANALYZE votes 命令已被执行
  • 8.4.11 和 9.2.3 版本的表现方式相同

1
你能提供该表的模式以及 EXPLAIN ANALYZE <query> 的结果吗? - Alex Gaynor
1
我不禁想到,你的基础表结构是问题的根源。首先,它没有键。其次,没有明确的理由为什么要在topic_id上将其自身连接起来。似乎结果只会是名称或投票的笛卡尔积。真正的问题是什么 - Mike Sherrill 'Cat Recall'
@wildplasser 是的,我可以 CREATE TABLE persons (id SERIAL PRIMARY KEY, name VARCHAR(64))… 但这会使我的用例变得复杂(出于各种好的原因)。 - David Wolever
不,那只会让你余生更加舒适。而且总有VIEW可以帮助你的。 - wildplasser
1
值得一看:http://stackoverflow.com/tags/postgresql-performance/info - Craig Ringer
显示剩余7条评论
2个回答

5

是的,文本比较有时很慢。你可以尝试以下方法:

SELECT * FROM vote_pairs WHERE name1 > name2 collate "C";

这应该会更快,因为它不考虑特定于语言环境的比较规则。另外,你的解析结果表明,你的 shared_buffers 可能设置得太低了。


哇!这个做法真的有效——添加collate可以提高查询速度,与没有比较条件的查询大致相同。谢谢! - David Wolever
2
如果你要这样做,最好在“name”列上添加一个“CHECK”约束,将其限制为7位ASCII字符范围,这是(几乎 - 噫,Shift-JIS)编码常见的范围。如果应用程序不能接受这种限制,那么“C”排序规则可能也不行。 - Craig Ringer
谢谢您的建议。不过,在这种情况下,这不是一个完整的应用程序...只是一些一次性的数据分析。但我会记住这个建议,以备将来之需。 - David Wolever

1
我猜测慢的原因是因为在交叉连接的每一行中,v1.name < v2.name过滤器会增加一些固定的操作。

一个更有效率的操作是检查v1.name <> v2.name,但这样你会得到重复的结果,例如(A,B), (B,A)。然后,我们可以将v1.name < v2.name添加回一个WHERE子句中,这将删除重复项,并希望由于我们简化的过滤器而命中较少的行。

试试这个:

CREATE OR REPLACE VIEW vote_pairs AS
    SELECT
        v1.name as name1,
        v2.name as name2
    FROM votes AS v1
    JOIN votes AS v2
        ON v1.topic_id = v2.topic_id AND v1.name <> v2.name
    WHERE v1.name < v2.name;

(编辑:似乎COLLATE“C”是正确的方法,但我将保留此答案,因为它是减少行暴露于缓慢操作的不错技巧。)

好的!这确实有所帮助——它将运行时间从约7秒减少到约4秒……但仍然比没有比较慢得多。 - David Wolever

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接