背景
我有一张表格,用于存储所有被路由器拦截的网络流量数据。目前,这个表格大约有590万行。
问题
我正在尝试运行一个简单的查询,以按天计算收到的数据包数量,这不应该花费太长时间。
第一次运行查询,需要88秒,然后再运行一次只需要33秒,之后的所有运行都只需要5秒。
主要问题不在于查询速度,而是在于执行相同查询3次之后,速度几乎快了20倍。
我了解查询缓存的概念,但原始查询的性能对我来说毫无意义。
测试
我使用的连接列(datetime)是timestamptz
类型,并且具有索引:
CREATE INDEX date ON netflows USING btree (datetime);
浏览EXPLAIN
语句,执行中的差异在于Nested Loop
。
我已经使用相同的结果对表进行了 VACUUM ANALYZE
。
当前环境
- 运行在 VMware ESX 4.1 上的 Linux Ubuntu 12.04 虚拟机
- PostgreSQL 9.1
- VM 具有2GB RAM、2个核心
- 数据库服务器完全专用于此,并且没有做任何其他事情
- 每分钟向表中插入一次数据(100行/分钟)
- 磁盘、内存或CPU活动非常低
查询
with date_list as (
select
series as start_date,
series + '23:59:59' as end_date
from
generate_series(
(select min(datetime) from netflows)::date,
(select max(datetime) from netflows)::date,
'1 day') as series
)
select
start_date,
end_date,
count(*)
from
netflows
inner join date_list on (datetime between start_date and end_date)
group by
start_date,
end_date;
首次运行解释(88秒)
Sort (cost=27007355.59..27007356.09 rows=200 width=8) (actual time=89647.054..89647.055 rows=18 loops=1)
Sort Key: date_list.start_date
Sort Method: quicksort Memory: 25kB
CTE date_list
-> Function Scan on generate_series series (cost=0.13..12.63 rows=1000 width=8) (actual time=92.567..92.667 rows=19 loops=1)
InitPlan 2 (returns $1)
-> Result (cost=0.05..0.06 rows=1 width=0) (actual time=71.270..71.270 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.05 rows=1 width=8) (actual time=71.259..71.261 rows=1 loops=1)
-> Index Scan using date on netflows (cost=0.00..303662.15 rows=5945591 width=8) (actual time=71.252..71.252 rows=1 loops=1)
Index Cond: (datetime IS NOT NULL)
InitPlan 4 (returns $3)
-> Result (cost=0.05..0.06 rows=1 width=0) (actual time=11.786..11.787 rows=1 loops=1)
InitPlan 3 (returns $2)
-> Limit (cost=0.00..0.05 rows=1 width=8) (actual time=11.778..11.779 rows=1 loops=1)
-> Index Scan Backward using date on netflows (cost=0.00..303662.15 rows=5945591 width=8) (actual time=11.776..11.776 rows=1 loops=1)
Index Cond: (datetime IS NOT NULL)
-> HashAggregate (cost=27007333.31..27007335.31 rows=200 width=8) (actual time=89639.167..89639.179 rows=18 loops=1)
-> Nested Loop (cost=0.00..23704227.20 rows=660621222 width=8) (actual time=92.667..88059.576 rows=5945457 loops=1)
-> CTE Scan on date_list (cost=0.00..20.00 rows=1000 width=16) (actual time=92.578..92.785 rows=19 loops=1)
-> Index Scan using date on netflows (cost=0.00..13794.89 rows=660621 width=8) (actual time=2.438..4571.884 rows=312919 loops=19)
Index Cond: ((datetime >= date_list.start_date) AND (datetime <= date_list.end_date))
Total runtime: 89668.047 ms
第三次运行的解释(5秒)
Sort (cost=27011357.45..27011357.95 rows=200 width=8) (actual time=5645.031..5645.032 rows=18 loops=1)
Sort Key: date_list.start_date
Sort Method: quicksort Memory: 25kB
CTE date_list
-> Function Scan on generate_series series (cost=0.13..12.63 rows=1000 width=8) (actual time=0.108..0.204 rows=19 loops=1)
InitPlan 2 (returns $1)
-> Result (cost=0.05..0.06 rows=1 width=0) (actual time=0.050..0.050 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.05 rows=1 width=8) (actual time=0.046..0.046 rows=1 loops=1)
-> Index Scan using date on netflows (cost=0.00..303705.14 rows=5946469 width=8) (actual time=0.046..0.046 rows=1 loops=1)
Index Cond: (datetime IS NOT NULL)
InitPlan 4 (returns $3)
-> Result (cost=0.05..0.06 rows=1 width=0) (actual time=0.026..0.026 rows=1 loops=1)
InitPlan 3 (returns $2)
-> Limit (cost=0.00..0.05 rows=1 width=8) (actual time=0.026..0.026 rows=1 loops=1)
-> Index Scan Backward using date on netflows (cost=0.00..303705.14 rows=5946469 width=8) (actual time=0.026..0.026 rows=1 loops=1)
Index Cond: (datetime IS NOT NULL)
-> HashAggregate (cost=27011335.17..27011337.17 rows=200 width=8) (actual time=5645.005..5645.009 rows=18 loops=1)
-> Nested Loop (cost=0.00..23707741.28 rows=660718778 width=8) (actual time=0.134..4176.406 rows=5946329 loops=1)
-> CTE Scan on date_list (cost=0.00..20.00 rows=1000 width=16) (actual time=0.110..0.343 rows=19 loops=1)
-> Index Scan using date on netflows (cost=0.00..13796.94 rows=660719 width=8) (actual time=0.026..164.117 rows=312965 loops=19)
Index Cond: ((datetime >= date_list.start_date) AND (datetime <= date_list.end_date))
Total runtime: 5645.189 ms
sync ; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
。 - Frank Farmertimezonetz
的时候实际上是想表达timestamptz
,并已经进行了修正。 - Erwin Brandstetter