我试图优化Postgres中的查询,但尝试失败。
这是我的表:
CREATE TABLE IF NOT EXISTS voc_cc348779bdc84f8aab483f662a798a6a (
id SERIAL,
date TIMESTAMP,
text TEXT,
themes JSONB,
meta JSONB,
canal VARCHAR(255),
source VARCHAR(255),
file VARCHAR(255)
);
我在id
和meta
列上创建了索引:
CREATE UNIQUE INDEX voc_cc348779bdc84f8aab483f662a798a6a_id ON voc_cc348779bdc84f8aab483f662a798a6a USING btree(id);
CREATE INDEX voc_cc348779bdc84f8aab483f662a798a6a_meta ON voc_cc348779bdc84f8aab483f662a798a6a USING btree(meta);
这个表格有62000行。
我试图优化的请求如下:
SELECT meta_split.key, meta_split.value, COUNT(DISTINCT(id))
FROM voc_cc348779bdc84f8aab483f662a798a6a
LEFT JOIN LATERAL jsonb_each(voc_cc348779bdc84f8aab483f662a798a6a.meta)
AS meta_split ON TRUE
WHERE meta_split.value IS NOT NULL
GROUP BY meta_split.key, meta_split.value;
在这个查询中,meta是一个类似于下面这个字典的对象:
{
"Age":"50 to 59 yo",
"Kids":"No kid",
"Gender":"Male"
}
我想获取每个键/值对应的完整列表以及每行的计数。以下是我的查询的EXPLAIN ANALYZE VERBOSE的结果:
GroupAggregate (cost=1138526.13..1201099.13 rows=100 width=72) (actual time=2016.984..2753.058 rows=568 loops=1)
Output: meta_split.key, meta_split.value, count(DISTINCT voc_cc348779bdc84f8aab483f662a798a6a.id)
Group Key: meta_split.key, meta_split.value
-> Sort (cost=1138526.13..1154169.13 rows=6257200 width=68) (actual time=2015.501..2471.027 rows=563148 loops=1)
Output: meta_split.key, meta_split.value, voc_cc348779bdc84f8aab483f662a798a6a.id
Sort Key: meta_split.key, meta_split.value
Sort Method: external merge Disk: 26672kB
-> Nested Loop (cost=0.00..131538.72 rows=6257200 width=68) (actual time=0.029..435.456 rows=563148 loops=1)
Output: meta_split.key, meta_split.value, voc_cc348779bdc84f8aab483f662a798a6a.id
-> Seq Scan on public.voc_cc348779bdc84f8aab483f662a798a6a (cost=0.00..6394.72 rows=62572 width=294) (actual time=0.007..16.588 rows=62572 loops=1)
Output: voc_cc348779bdc84f8aab483f662a798a6a.id, voc_cc348779bdc84f8aab483f662a798a6a.date, voc_cc348779bdc84f8aab483f662a798a6a.text, voc_cc348779bdc84f8aab483f662a798a6a.themes, voc_cc348779bdc84f8aab483f662a798a6a.meta, voc_cc348779bdc84f8aab483f662a798a6a.canal, voc_cc348779bdc84f8aab483f662a798a6a.source, voc_cc348779bdc84f8aab483f662a798a6a.file
-> Function Scan on pg_catalog.jsonb_each meta_split (cost=0.00..1.00 rows=100 width=64) (actual time=0.005..0.005 rows=9 loops=62572)
Output: meta_split.key, meta_split.value
Function Call: jsonb_each(voc_cc348779bdc84f8aab483f662a798a6a.meta)
Filter: (meta_split.value IS NOT NULL)
Planning Time: 1.502 ms
Execution Time: 2763.309 ms
我尝试将 COUNT(DISTINCT(id))
改为 COUNT(DISTINCT voc_cc348779bdc84f8aab483f662a798a6a.*)
或使用子查询,结果分别变慢了10倍和30倍。我还考虑过使用单独的计数表来维护这些计数;但是由于需要过滤结果(比如,有时查询会在date
列或类似列上进行筛选),所以无法这样做。
我不知道该如何进一步优化,但是即使行数很少,它也非常缓慢 - 我预计稍后会有十倍于此的行数,如果速度与数量成比例增加,那会太慢了,就像前面的62k数据一样。