PostgreSQL的array_agg函数非常缓慢

3

我有两个表 callscalls_statistics。在calls表中,主键是 calls_id,它也是calls_statistics 表中的外键。

calls表目前包含16k条记录。

当我运行以下命令时:

SELECT c.*,
array_agg(cs.mean) AS statistics_means
FROM calls AS c
LEFT JOIN calls_statistics AS cs ON c.calls_id = cs.calls_id
GROUP BY c.calls_id
order by caller_id ASC, call_time ASC LIMIT 100;

查询大约需要 622 毫秒。

 Limit  (cost=11947.99..11948.24 rows=100 width=551) (actual time=518.921..518.941 rows=100 loops=1)
   ->  Sort  (cost=11947.99..11989.07 rows=16429 width=551) (actual time=518.918..518.928 rows=100 loops=1)
         Sort Key: c.caller_id, c.call_time
         Sort Method: top-N heapsort  Memory: 126kB
         ->  HashAggregate  (cost=11114.73..11320.09 rows=16429 width=551) (actual time=461.869..494.761 rows=16429 loops=1)
               ->  Hash Right Join  (cost=6234.65..10705.12 rows=81922 width=551) (actual time=79.171..257.498 rows=81922 loops=1)
                     Hash Cond: (cs.calls_id = c.calls_id)
                     ->  Seq Scan on calls_statistics cs  (cost=0.00..2627.22 rows=81922 width=12) (actual time=3.534..26.778 rows=81922 loops=1)
                     ->  Hash  (cost=6029.29..6029.29 rows=16429 width=547) (actual time=75.578..75.578 rows=16429 loops=1)
                           Buckets: 2048  Batches: 1  Memory Usage: 9370kB
                           ->  Seq Scan on calls c  (cost=0.00..6029.29 rows=16429 width=547) (actual time=13.806..42.446 rows=16429 loops=1)
 Total runtime: 622.537 ms

然而,当我禁用array_agg并运行查询时,它使用我的索引:
SELECT c.*,
cs.mean
FROM calls AS c
LEFT JOIN calls_statistics AS cs ON c.calls_id = cs.calls_id
order by caller_id ASC, call_time ASC LIMIT 100;

这个查询只需要0.565毫秒!

 Limit  (cost=0.70..52.93 rows=100 width=551) (actual time=0.077..0.320 rows=100 loops=1)
   ->  Nested Loop Left Join  (cost=0.70..42784.95 rows=81922 width=551) (actual time=0.075..0.304 rows=100 loops=1)
         ->  Index Scan using calls_caller_id_call_time_calls_id_idx on calls c  (cost=0.29..22395.06 rows=16429 width=547) (actual time=0.042..0.091 rows=25 loops=1)
         ->  Index Scan using calls_stats_calls_idx on calls_statistics cs  (cost=0.42..1.18 rows=6 width=12) (actual time=0.003..0.005 rows=4 loops=25)
               Index Cond: (c.calls_id = calls_id)
 Total runtime: 0.565 ms

仅仅聚合为数组需要这么长的时间?我做错了什么吗?

我正在使用Postgres 9.3。


我还注意到sort期望16429行,但只得到了100行。不知何故,查询计划器自己欺骗了自己... - user3207838
4
你在第二个查询中删除了group by语句。这很可能是影响时间的主要原因。 - Gordon Linoff
@GordonLinoff 是的,我知道,但这正是我提问的重点。 - user3207838
6
第一个查询处理并聚合了81922行数据,然后舍弃掉其中的81822行。第二个查询只处理了100行数据。当然,处理(和分组)81922行数据比仅检索100行数据需要更长的时间。 - user330315
@user3207838 选择100行与分组聚合联接表和限制输出是完全不同的操作。这就像你在比较两个不同的查询,并对执行时间感到失望一样。就像你可以使用select * from table_a'和select 1`,并惊讶于第二个更快一样。 - Gabriel's Messanger
显示剩余4条评论
2个回答

1
选取表calls中的100行,然后连接和聚合calls_statistics是一种选择。
类似这样:
WITH top_calls as (SELECT c.*
FROM calls AS c
ORDER BY caller_id ASC, call_time ASC 
LIMIT 100)
SELECT c.*,
array_agg(cs.mean) AS statistics_means
FROM top_calls AS c
LEFT JOIN calls_statistics AS cs ON c.calls_id = cs.calls_id
GROUP BY c.calls_id
order by caller_id ASC, call_time ASC;

它将为您提供与第一次查询完全相同的输出。

0

在没有所有信息和实时系统的情况下优化查询可能有点困难,但我会尝试一下。您可以将限制移动到子查询中,这样应该可以更快地工作。

SELECT c.*,
array_agg(cs.mean) AS statistics_means
FROM 
  (SELECT *
   FROM calls AS c
   ORDER BY caller_id ASC, call_time ASC 
   LIMIT 100) AS c
LEFT JOIN calls_statistics AS cs ON c.calls_id = cs.calls_id
GROUP BY c.calls_id;

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接