首次运行查询非常缓慢,即使有索引。

4

我有一个非常慢的查询,尽管使用了索引,但速度仍然很慢(大约需要1-3分钟)。用户将运行类似的查询4-6次,因此速度至关重要。

查询:

            SELECT SUM(bh.count) AS count,b.time AS batchtime
            FROM
              batchtimes AS b
            INNER JOIN batchtimes_headlines AS bh ON b.hashed_id = bh.batchtime_hashed_id
            INNER JOIN headlines_ngrams AS hn ON bh.headline_hashed_id = hn.headline_hashed_id
            INNER JOIN ngrams AS n ON hn.ngram_hashed_id = n.hashed_id
            INNER JOIN homepages_headlines AS hh ON bh.headline_hashed_id = hh.headline_hashed_id
            INNER JOIN homepages AS hp ON hh.homepage_hashed_id = hp.hashed_id
            WHERE
              b.time IN (SELECT * FROM generate_series('2013-10-10 20:00:00.000000'::timestamp,'2014-02-16 20:00:00.000000'::timestamp,'1 hours'))
              AND ( n.gram = 'a' )
              AND hp.url = 'www.abcdefg.com'
            GROUP BY
              b.time
            ORDER BY
              b.time ASC;

第一次运行后的EXPLAIN ANALYZE:

GroupAggregate  (cost=6863.26..6863.79 rows=30 width=12) (actual time=90905.858..90908.889 rows=3039 loops=1)
 ->  Sort  (cost=6863.26..6863.34 rows=30 width=12) (actual time=90905.853..90906.971 rows=19780 loops=1)
     Sort Key: b."time"
     Sort Method: quicksort  Memory: 1696kB
     ->  Hash Join  (cost=90.16..6862.52 rows=30 width=12) (actual time=378.784..90890.636 rows=19780 loops=1)
           Hash Cond: (b."time" = generate_series.generate_series)
           ->  Nested Loop  (cost=73.16..6845.27 rows=60 width=12) (actual time=375.644..90859.059 rows=22910 loops=1)
                 ->  Nested Loop  (cost=72.88..6740.51 rows=60 width=37) (actual time=375.624..90618.828 rows=22910 loops=1)
                       ->  Nested Loop  (cost=42.37..4391.06 rows=1 width=66) (actual time=368.993..54607.402 rows=1213 loops=1)
                             ->  Nested Loop  (cost=42.23..4390.18 rows=5 width=99) (actual time=223.681..53051.774 rows=294787 loops=1)
                                   ->  Nested Loop  (cost=41.68..4379.19 rows=5 width=33) (actual time=223.643..49403.746 rows=294787 loops=1)
                                         ->  Index Scan using by_gram_ngrams on ngrams n  (cost=0.56..8.58 rows=1 width=33) (actual time=17.001..17.002 rows=1 loops=1)
                                               Index Cond: ((gram)::text = 'a'::text)
                                         ->  Bitmap Heap Scan on headlines_ngrams hn  (cost=41.12..4359.59 rows=1103 width=66) (actual time=206.634..49273.363 rows=294787 loops=1)
                                               Recheck Cond: ((ngram_hashed_id)::text = (n.hashed_id)::text)
                                               ->  Bitmap Index Scan on by_ngramhashedid_headlinesngrams  (cost=0.00..40.84 rows=1103 width=0) (actual time=143.430..143.430 rows=294787 loops=1)
                                                     Index Cond: ((ngram_hashed_id)::text = (n.hashed_id)::text)
                                   ->  Index Scan using by_headlinehashedid_homepagesheadlines on homepages_headlines hh  (cost=0.56..2.19 rows=1 width=66) (actual time=0.011..0.011 rows=1 loops=294787)
                                         Index Cond: ((headline_hashed_id)::text = (hn.headline_hashed_id)::text)
                             ->  Index Scan using by_hashedid_homepages on homepages hp  (cost=0.14..0.17 rows=1 width=33) (actual time=0.005..0.005 rows=0 loops=294787)
                                   Index Cond: ((hashed_id)::text = (hh.homepage_hashed_id)::text)
                                   Filter: ((url)::text = 'www.abcdefg.com'::text)
                                   Rows Removed by Filter: 1
                       ->  Bitmap Heap Scan on batchtimes_headlines bh  (cost=30.51..2333.86 rows=1560 width=70) (actual time=7.977..29.674 rows=19 loops=1213)
                             Recheck Cond: ((headline_hashed_id)::text = (hn.headline_hashed_id)::text)
                             ->  Bitmap Index Scan on by_headlinehashedid_batchtimesheadlines  (cost=0.00..30.12 rows=1560 width=0) (actual time=6.595..6.595 rows=19 loops=1213)
                                   Index Cond: ((headline_hashed_id)::text = (hn.headline_hashed_id)::text)
                 ->  Index Scan using by_hashedid_batchtimes on batchtimes b  (cost=0.28..1.74 rows=1 width=41) (actual time=0.009..0.009 rows=1 loops=22910)
                       Index Cond: ((hashed_id)::text = (bh.batchtime_hashed_id)::text)
           ->  Hash  (cost=14.50..14.50 rows=200 width=8) (actual time=3.130..3.130 rows=3097 loops=1)
                 Buckets: 1024  Batches: 1  Memory Usage: 121kB
                 ->  HashAggregate  (cost=12.50..14.50 rows=200 width=8) (actual time=1.819..2.342 rows=3097 loops=1)
                       ->  Function Scan on generate_series  (cost=0.00..10.00 rows=1000 width=8) (actual time=0.441..0.714 rows=3097 loops=1)

总运行时间:90911.001毫秒

第二次运行后的EXPLAIN ANALYZE:

GroupAggregate  (cost=6863.26..6863.79 rows=30 width=12) (actual time=3122.861..3125.796 rows=3039 loops=1)
 ->  Sort  (cost=6863.26..6863.34 rows=30 width=12) (actual time=3122.857..3123.882 rows=19780 loops=1)
     Sort Key: b."time"
     Sort Method: quicksort  Memory: 1696kB
     ->  Hash Join  (cost=90.16..6862.52 rows=30 width=12) (actual time=145.396..3116.467 rows=19780 loops=1)
           Hash Cond: (b."time" = generate_series.generate_series)
           ->  Nested Loop  (cost=73.16..6845.27 rows=60 width=12) (actual time=142.406..3102.864 rows=22910 loops=1)
                 ->  Nested Loop  (cost=72.88..6740.51 rows=60 width=37) (actual time=142.395..3011.768 rows=22910 loops=1)
                       ->  Nested Loop  (cost=42.37..4391.06 rows=1 width=66) (actual time=142.229..2969.144 rows=1213 loops=1)
                             ->  Nested Loop  (cost=42.23..4390.18 rows=5 width=99) (actual time=135.799..2142.666 rows=294787 loops=1)
                                   ->  Nested Loop  (cost=41.68..4379.19 rows=5 width=33) (actual time=135.768..437.824 rows=294787 loops=1)
                                         ->  Index Scan using by_gram_ngrams on ngrams n  (cost=0.56..8.58 rows=1 width=33) (actual time=0.030..0.031 rows=1 loops=1)
                                               Index Cond: ((gram)::text = 'a'::text)
                                         ->  Bitmap Heap Scan on headlines_ngrams hn  (cost=41.12..4359.59 rows=1103 width=66) (actual time=135.732..405.943 rows=294787 loops=1)
                                               Recheck Cond: ((ngram_hashed_id)::text = (n.hashed_id)::text)
                                               ->  Bitmap Index Scan on by_ngramhashedid_headlinesngrams  (cost=0.00..40.84 rows=1103 width=0) (actual time=72.570..72.570 rows=294787 loops=1)
                                                     Index Cond: ((ngram_hashed_id)::text = (n.hashed_id)::text)
                                   ->  Index Scan using by_headlinehashedid_homepagesheadlines on homepages_headlines hh  (cost=0.56..2.19 rows=1 width=66) (actual time=0.005..0.005 rows=1 loops=294787)
                                         Index Cond: ((headline_hashed_id)::text = (hn.headline_hashed_id)::text)
                             ->  Index Scan using by_hashedid_homepages on homepages hp  (cost=0.14..0.17 rows=1 width=33) (actual time=0.003..0.003 rows=0 loops=294787)
                                   Index Cond: ((hashed_id)::text = (hh.homepage_hashed_id)::text)
                                   Filter: ((url)::text = 'www.abcdefg.com'::text)
                                   Rows Removed by Filter: 1
                       ->  Bitmap Heap Scan on batchtimes_headlines bh  (cost=30.51..2333.86 rows=1560 width=70) (actual time=0.015..0.031 rows=19 loops=1213)
                             Recheck Cond: ((headline_hashed_id)::text = (hn.headline_hashed_id)::text)
                             ->  Bitmap Index Scan on by_headlinehashedid_batchtimesheadlines  (cost=0.00..30.12 rows=1560 width=0) (actual time=0.013..0.013 rows=19 loops=1213)
                                   Index Cond: ((headline_hashed_id)::text = (hn.headline_hashed_id)::text)
                 ->  Index Scan using by_hashedid_batchtimes on batchtimes b  (cost=0.28..1.74 rows=1 width=41) (actual time=0.003..0.004 rows=1 loops=22910)
                       Index Cond: ((hashed_id)::text = (bh.batchtime_hashed_id)::text)
           ->  Hash  (cost=14.50..14.50 rows=200 width=8) (actual time=2.982..2.982 rows=3097 loops=1)
                 Buckets: 1024  Batches: 1  Memory Usage: 121kB
                 ->  HashAggregate  (cost=12.50..14.50 rows=200 width=8) (actual time=1.771..2.311 rows=3097 loops=1)
                       ->  Function Scan on generate_series  (cost=0.00..10.00 rows=1000 width=8) (actual time=0.439..0.701 rows=3097 loops=1)

总运行时间:3125.985毫秒

我有一台32GB的服务器。以下是对postgresql.conf的修改:

  • default_statistics_target = 100
  • maintenance_work_mem = 1920MB
  • checkpoint_completion_target = 0.9
  • effective_cache_size = 16GB
  • work_mem = 160MB
  • wal_buffers = 16MB
  • checkpoint_segments = 32
  • shared_buffers = 7680MB

数据库最近已进行了清理、重新索引和分析。

您有任何关于如何调整此查询的建议吗?


2
我唯一觉得有问题的是 "b.time IN (SELECT * FROM generate_series('2013-10-10 20:00:00.000000'::timestamp,'2014-02-16 20:00:00.000000'::timestamp,'1 hours'))" 这一行...你想用这一行实现什么功能? - Twelfth
2
你的统计数据有误。首先,在所有相关表上使用“VACUUM ANALYZE”(你是否有键/索引?)开始处理。 - wildplasser
1
顺便问一下,b.time 的实际粒度/分辨率是多少?为什么不使用 b.time >= 2013-10-01 AND b.time < '2014-02-17(并且在聚合之前可能要添加一些 date_trunc())?另外,你的 work_mem 很高(而且这个查询不需要),你有多少个活动会话?vmstat/top 显示了什么? - wildplasser
1
把我加入认为generate_series可疑的人名单中。当然,您可以使用BETWEEN来获得所需内容,这还可以让您在b.time上使用索引进行更好的操作。顺便说一句:使用像"time"这样的保留字作为列名是合法的,但这会给您带来麻烦。 - Andrew Lazarus
1
@JustAFriend generate_series非常有用,但由于其输出未索引,它限制了一次性执行连接的能力(除非规划器比我想象的还要聪明)。以您使用的方式,我认为它几乎保证至少有一个嵌套循环,而这些通常是不好的。您能解释一下与使用BETWEEN相比的优势吗? - Andrew Lazarus
显示剩余15条评论
1个回答

1
这可能回答了您的问题,也可能没有。由于我在Stack Overflow上没有50个声望,因此无法在上面发表评论。:/
我的第一个问题是为什么使用Inner Join..?这会返回您不需要的列。例如,在您的查询中,当您进行内部连接时
INNER JOIN headlines_ngrams AS hn ON bh.headline_hashed_id = hn.headline_hashed_id
结果将具有相同信息的两列,这是多余的。因此,例如,如果您有100,000,000行,则每列中都将有bh.headline_hashed_id和hh.headline_hashed_id 100,000,000个条目。在您上面的查询中,您正在连接5个表。此外,您只对以下内容感兴趣
SELECT SUM(bh.count) AS count,b.time AS batchtime 所以我认为您应该使用自然连接。
[链接] (http://en.wikipedia.org/wiki/Inner_join#Inner_join)
我认为第二次尝试获得更好性能的原因是缓存。其他人已经提到使用临时表来生成序列可能是一个不错的选择。此外,如果您考虑在查询中使用WITH,那么您应该阅读这篇文章。link

内连接中多余列的观点很好!感谢提供WITH文章。我会尝试一些实验来应用这些观点。 - JustAFriend

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接