我有一个非常简单的 SQL 查询:
SELECT COUNT(DISTINCT x) FROM table;
我的表格有大约150万行。这个查询运行得相当慢,需要大约7.5秒的时间,而比较起来
SELECT COUNT(x) FROM table;
我的查询大约需要435毫秒的时间,有没有什么方法可以改进性能?我已经尝试了分组和正常计数,还在x上放了一个索引;但两者的执行时间都是7.5秒。
我有一个非常简单的 SQL 查询:
SELECT COUNT(DISTINCT x) FROM table;
我的表格有大约150万行。这个查询运行得相当慢,需要大约7.5秒的时间,而比较起来
SELECT COUNT(x) FROM table;
我的查询大约需要435毫秒的时间,有没有什么方法可以改进性能?我已经尝试了分组和正常计数,还在x上放了一个索引;但两者的执行时间都是7.5秒。
你可以使用这个:
SELECT COUNT(*) FROM (SELECT DISTINCT column_name FROM table_name) AS temp;
这比以下更快:
COUNT(DISTINCT column_name)
COUNT(DISTINCT())
会进行排序,因此在column_name
上建立索引肯定会有帮助,特别是当work_mem
相对较小时(哈希将产生相对较大的批处理)。 因此,使用COUNT (DISTINCT())
并不总是不好的选择,对吗? - St.AntarioCount(column)
只计算非空值。count(*)
计算行数。因此,第一个/较长的函数也会计算空行(一次)。将它改为 count(column_name)
可使它们的行为相同。 - GolezTrol-- My default settings (this is basically a single-session machine, so work_mem is pretty high)
SET effective_cache_size='2048MB';
SET work_mem='16MB';
\echo original
EXPLAIN ANALYZE
SELECT
COUNT (distinct val) as aantal
FROM one
;
\echo group by+count(*)
EXPLAIN ANALYZE
SELECT
distinct val
-- , COUNT(*)
FROM one
GROUP BY val;
\echo with CTE
EXPLAIN ANALYZE
WITH agg AS (
SELECT distinct val
FROM one
GROUP BY val
)
SELECT COUNT (*) as aantal
FROM agg
;
结果:
original QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Aggregate (cost=36448.06..36448.07 rows=1 width=4) (actual time=1766.472..1766.472 rows=1 loops=1)
-> Seq Scan on one (cost=0.00..32698.45 rows=1499845 width=4) (actual time=31.371..185.914 rows=1499845 loops=1)
Total runtime: 1766.642 ms
(3 rows)
group by+count(*)
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=36464.31..36477.31 rows=1300 width=4) (actual time=412.470..412.598 rows=1300 loops=1)
-> HashAggregate (cost=36448.06..36461.06 rows=1300 width=4) (actual time=412.066..412.203 rows=1300 loops=1)
-> Seq Scan on one (cost=0.00..32698.45 rows=1499845 width=4) (actual time=26.134..166.846 rows=1499845 loops=1)
Total runtime: 412.686 ms
(4 rows)
with CTE
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=36506.56..36506.57 rows=1 width=0) (actual time=408.239..408.239 rows=1 loops=1)
CTE agg
-> HashAggregate (cost=36464.31..36477.31 rows=1300 width=4) (actual time=407.704..407.847 rows=1300 loops=1)
-> HashAggregate (cost=36448.06..36461.06 rows=1300 width=4) (actual time=407.320..407.467 rows=1300 loops=1)
-> Seq Scan on one (cost=0.00..32698.45 rows=1499845 width=4) (actual time=24.321..165.256 rows=1499845 loops=1)
-> CTE Scan on agg (cost=0.00..26.00 rows=1300 width=0) (actual time=407.707..408.154 rows=1300 loops=1)
Total runtime: 408.300 ms
(7 rows)
使用窗口函数可能也可以产生与CTE相同的计划。
distinct x)
列上是否有可用的索引?count(distinct(x))
比count(x)
慢很多,那么你可以通过在不同的表中维护x值计数来加速这个查询,例如table_name_x_counts (x integer not null, x_count int not null)
,使用触发器。但是你的写入性能会受到影响,如果你在单个事务中更新多个x
值,则需要按照某种明确的顺序进行操作,以避免可能的死锁。SELECT DISTINCT COUNT(*) OVER() as total_count, * FROM table_name limit 2 offset 0;
查询性能也很高。我曾经遇到过类似的问题,但是我想要计算多列。所以我尝试了以下两个查询。
计算不同值数量:
SELECT
to_char(action_date, 'YYYY-MM') as "Month",
count(*) as "Count",
count(distinct batch_id)
FROM transactions t
JOIN batches b on t.batch_id = b.id
GROUP BY to_char(action_date, 'YYYY-MM')
ORDER BY to_char(action_date, 'YYYY-MM');
子查询:
WITH batch_counts AS (
SELECT to_char(action_date, 'YYYY-MM') as "Month",
COUNT(*) as t_count
FROM transactions t
JOIN batches b on t.batch_id = b.id
GROUP BY b.id
)
SELECT "Month",
SUM(t_count) as "Transactions",
COUNT(*) as "Batches"
FROM batch_counts
GROUP BY "Month"
ORDER BY "Month";
我在我的测试数据中多次运行这两个查询,大约有10万行。子查询方法平均运行时间约为90毫秒,但是计数唯一值的方法平均需要约200毫秒。
psql
的\d
输出是一个好的选择),并指明您遇到问题的列。最好能够查看两个查询的EXPLAIN ANALYZE
。 - vyegorov