优化Postgres搜索查询的问题

Question

优化Postgres搜索查询的问题

5

我在以下的PostgreSQL查询中遇到了问题，它需要超过10秒才能运行，有没有什么方法可以加速这个查询以达到合理的速度，我只是在寻找与非常大的数据库上的视频相关的最相关的搜索词。

  SELECT count(*), videoid 
  FROM term_search 
  where word = 'tester' 
     OR word = 'question' 
     OR word = 'one' 
  group by videoid 
  order by count(*) desc 
  limit 1800;

当使用 analyze 运行查询时，得到的查询计划如下 (http://explain.depesz.com/s/yDJ):

  Limit  (cost=389625.50..389630.00 rows=1800 width=4) (actual time=11766.693..11770.001 rows=1800 loops=1)
     Output: (count(*)), videoid
     ->  Sort  (cost=389625.50..389692.68 rows=26873 width=4) (actual time=11766.689..11767.818 rows=1800 loops=1)
           Output: (count(*)), videoid
           Sort Key: (count(*))
           Sort Method: top-N heapsort  Memory: 181kB
           ->  HashAggregate  (cost=387769.41..388038.14 rows=26873 width=4) (actual time=9215.653..10641.993 rows=1632578 loops=1)
                 Output: count(*), videoid
                 ->  Bitmap Heap Scan on public.term_search  (cost=44915.83..378163.38 rows=1921207 width=4) (actual time=312.449..7026.036 rows=2047691 loops=1)
                       Output: id, videoid, word, termindex, weight
                       Recheck Cond: (((term_search.word)::text = 'tester'::text) OR ((term_search.word)::text = 'question'::text) OR ((term_search.word)::text = 'one'::text))
                       Rows Removed by Index Recheck: 25512434
                       ->  BitmapOr  (cost=44915.83..44915.83 rows=1950031 width=0) (actual time=288.937..288.937 rows=0 loops=1)
                             ->  Bitmap Index Scan on terms_word_idx  (cost=0.00..8552.83 rows=383502 width=0) (actual time=89.266..89.266 rows=419750 loops=1)
                                   Index Cond: ((term_search.word)::text = 'tester'::text)
                             ->  Bitmap Index Scan on terms_word_idx  (cost=0.00..13171.84 rows=590836 width=0) (actual time=89.700..89.700 rows=604348 loops=1)
                                   Index Cond: ((term_search.word)::text = 'question'::text)
                             ->  Bitmap Index Scan on terms_word_idx  (cost=0.00..21750.26 rows=975693 width=0) (actual time=109.964..109.964 rows=1023593 loops=1)
                                   Index Cond: ((term_search.word)::text = 'one'::text)

表的架构如下：

    Column   |          Type          |                        Modifiers                         | Storage  | Description 
  -----------+------------------------+----------------------------------------------------------+----------+-------------
   id        | integer                | not null default nextval('term_search_id_seq'::regclass) | plain    | 
   videoid   | integer                |                                                          | plain    | 
   word      | character varying(100) |                                                          | extended | 
   termindex | character varying(15)  |                                                          | extended | 
   weight    | smallint               |                                                          | plain    | 
  Indexes:
      "term_search_pkey" PRIMARY KEY, btree (id)
      "search_term_exists_idx" btree (videoid, word)
      "terms_caverphone_idx" btree (termindex)
      "terms_video_idx" btree (videoid)
      "terms_word_idx" btree (word, videoid)
  Foreign-key constraints:
      "term_search_videoid_fkey" FOREIGN KEY (videoid) REFERENCES videos(id) ON DELETE CASCADE
  Has OIDs: no

我已经通过仅使用索引扫描将其降低到7秒，但时间仍然不够短。我在AWS r3.xlarge实例上运行Ubuntu 14.04上的PostgreSQL 9.3，并且表中有大约5000万行。非常感谢您提供任何建议！

编辑：

以下是SELECT schemaname，tablename，attname，null_frac，avg_width，n_distinct FROM pg_stats WHERE schemaname ='public' and tablename ='term_search'; 的结果。

 schemaname |  tablename  |  attname  | null_frac | avg_width | n_distinct 
 ------------+-------------+-----------+-----------+-----------+------------
 public     | term_search | id        |         0 |         4 |         -1
 public     | term_search | videoid   |         0 |         4 |     568632
 public     | term_search | word      |         0 |         6 |       5054
 public     | term_search | termindex |         0 |        11 |       2485
 public     | term_search | weight    |         0 |         2 |          3

- user3354369

1

使用 word IN ('tester','question','one') 与使用 word='tester' OR word='question' OR word='one' 有什么区别吗？ - StanislavL

我尝试使用“in”和“ANY”代替“or”，但并没有太大的区别。总运行时间：10931.711毫秒 - user3354369

1

你能改变表 term_search 的结构吗？列 word 有多少个不同的值？ - klin

@klin 是的，我可以将结构更改为任何内容。我正在收集单词不同值的查询，明天早上我会发布。 - user3354369

@vyegorov 这只有2倍的误差，这是完全合理的。请记住，这是一种统计抽样，是一种估计。此外，更高的统计目标会导致统计收集和规划成本更高，它们并不是免费的（否则它们默认会非常高）。通常情况下，当统计数据相差几个数量级时，您应该感到担忧，除了1与非1的特殊情况。 - Craig Ringer

显示剩余4条评论

4个回答

1

让我们从重新表述查询开始，以解释它实际上想要做什么。

查询如下：

  SELECT count(*), videoid 
  FROM term_search 
  where word = 'tester' 
     OR word = 'question' 
     OR word = 'one' 
  group by videoid 
  order by count(*) desc 
  limit 1800;

看起来是这样的：

“在一个搜索词表中，找到包含搜索词tester，question或one的视频。计算每个视频的匹配次数，并返回匹配次数最多的1800个视频。”

或者更普遍地说：

“找到最符合我的搜索词的视频，并展示前n个最佳匹配结果。”

对吗？

如果是这样，为什么不使用PostgreSQL内置的全文搜索和全文索引呢？针对每个视频的索引tsquery匹配tsvector很可能会胜出。全文搜索具有模糊匹配、排名和几乎所有你需要的功能——与你当前的方法不同，它不需要整个数据集被实现并排序，然后才能丢弃大部分数据。

由于你没有提供示例数据，所以我无法进行演示。

PostgreSQL当前执行您的查询的方式可以解释为：

创建一个地图，每个表中的磁盘页（8kb）都有一位，其中true表示该页面可能包含一个或多个匹配行。对于每个搜索词，在索引terms_word_idx上进行扫描，并更新位图以设置找到匹配项的位。扫描表格，跳过位图指示没有匹配项的页面，查找具有任何单词的行。这类似于快速跳过空白的seqscan。如果匹配的百分比很高，则实际上并不比普通的seqscan快得多。对于每个匹配行，根据视频ID将其排序为一系列“桶”。然后在最后，计算每个桶中有多少行，并返回该桶的计数+视频ID。（它并不简单，但足够接近）。当您计算每个桶时，请将结果放在下一个最高和最低计数之间的结果中。取前1800个结果，并丢弃所有其他结果。

那听起来不是很有趣，但它没有选择。 B树索引不能下降以同时搜索多个术语，因此必须进行多个索引扫描。其余部分也随之而来。

因此：要使此更有效率，您需要从根本上改变解决问题的方式。添加索引或调整某些参数不会突然使其花费0.5秒。

- Craig Ringer

0

您可以优化PostgreSQL设置以减少查询执行时间。例如，您可以使用pgtune工具：

apt-get install pgtune
cd /etc/postgresql/*.*/main/
cp postgresql.conf postgresql.conf.default
pgtune -i postgresql.conf.default -o postgresql.conf --type=%TYPE%

这里的%TYPE%是以下值之一：

DATA 用于大数据量、大查询和低频调用
WEB 用于Web应用程序，最适合Django应用程序和其他Web应用程序

有关pgtune的其他信息可以在Google和帮助中找到。

对于PostgreSQL < 9.3，您必须使用此脚本：

#!/bin/bash
# simple shmsetup script
page_size=`getconf PAGE_SIZE`
phys_pages=`getconf _PHYS_PAGES`
shmall=`expr $phys_pages / 2`
shmmax=`expr $shmall \* $page_size`
echo kernel.shmmax = $shmmax
echo kernel.shmall = $shmall

将结果写入文件 /etc/sysctl.conf 并重新启动系统。否则，Postgres 无法启动。

- Dunaevsky Maxim

2

它最多只是一个相当粗糙的工具，远非神奇的“让我的数据库更快”的工具。当面对特定查询时，最好查看查询计划、统计信息等。 - Craig Ringer

0

其他人已经提供了一些建议来重构数据库，但是你可能可以让当前的查询运行得更好。下面的EXPLAIN中的这行提示表明你的位图溢出了：

Rows Removed by Index Recheck: 25512434

如果重新检查是消耗时间的原因（而不是IO消耗时间 - 如果您运行EXPLAIN（ANALYZE，BUFFERS），它将有助于澄清这一点，特别是如果track_io_timing打开），那么增加work_mem可能会有很大帮助，假设您可以负担得起这样做而不会耗尽RAM。

- jjanes

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- klin · Accepted Answer

如果我有机会让用户断开连接一晚，我会：

创建一个包含来自term_search的words的新表格，
创建参考新表格的引用，
删除word列，

类似于这样：

create table words (
    word_id serial primary key,
    word text);

insert into words (word)
    select distinct word
    from term_search;

alter table term_search add column word_id integer;

update term_search t
    set word_id = w.word_id
    from words w
    where t.word = w.word;

alter table term_search add constraint term_search_word_fkey 
    foreign key (word_id) references words (word_id);

测试：

SELECT count(*), videoid 
FROM term_search t
JOIN words w on t.word_id = w.word_id
WHERE w.word = 'tester' 
   OR w.word = 'question' 
   OR w.word = 'one' 
GROUP BY videoid 
ORDER BY count(*) desc 
LIMIT 1800;    

-- if was faster then
    alter table term_search drop column word;
-- and on the fly...
    alter table term_search alter termindex type text;

革命后，我需要负责对term_search进行插入和更新的处理。我可能会创建一个视图，其中包含有关插入和更新的规则。