PostgreSQL中的IN语句性能（以及一般情况下）

Question

PostgreSQL中的IN语句性能（以及一般情况下）

3

我知道这个问题可能已经被问过了，但是我在SO的搜索中找不到。

假设我有TABLE1和TABLE2，那么像这样的查询语句的性能应该如何：

SELECT * FROM TABLE1 WHERE id IN SUBQUERY_ON_TABLE2;

随着TABLE1和TABLE2中的行数增加，且id是TABLE1的主键，查询性能将会下降。使用IN语句查询数据是一个常见的错误，但由于TABLE2具有通用关系（Django通用关系）到多个其他表，因此我无法想到其他过滤数据的方法。当TABLE1和TABLE2中的行数达到多少时，我应该期望因此出现性能问题？性能是否会根据行数呈线性、指数等方式退化？请注意，保留HTML标记。

- Vasil

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Greg Smith · Accepted Answer

当子查询返回的记录数量较少，主查询返回的行数也很少时，每个查询都将快速进行索引查找。随着所返回数据的百分比增加，最终两者都会改为使用顺序扫描而不是索引扫描，在一次性获取整个表而不是逐个获取的情况下。这不是一个简单的性能下降，它不是线性的或指数级的；在计划类型改变时存在重大的不连续性。并且发生这些情况的行数取决于表的大小，因此您也没有有用的经验法则。您应该像下面我正在做的那样构建一个模拟，并查看自己的数据集上会发生什么，以了解曲线的形状。

这里有一个使用已装载Dell Store 2数据库的PostgreSQL 9.0数据库的示例。一旦子查询返回1000行，它就会对主表进行全表扫描。而一旦子查询考虑到10000条记录，那么它也会变成全表扫描。这些都运行了两次，所以你看到的是缓存性能。基于缓存与未缓存状态的性能变化是完全不同的话题。

dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN 
  (SELECT customerid FROM orders WHERE orderid<2);
Nested Loop  (cost=8.27..16.56 rows=1 width=268) (actual time=0.051..0.060 rows=1 loops=1)
  ->  HashAggregate  (cost=8.27..8.28 rows=1 width=4) (actual time=0.028..0.030 rows=1 loops=1)
        ->  Index Scan using orders_pkey on orders  (cost=0.00..8.27 rows=1 width=4) (actual time=0.011..0.015 rows=1 loops=1)
              Index Cond: (orderid < 2)
  ->  Index Scan using customers_pkey on customers  (cost=0.00..8.27 rows=1 width=268) (actual time=0.013..0.016 rows=1 loops=1)
        Index Cond: (customers.customerid = orders.customerid)
Total runtime: 0.191 ms

dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN 
  (SELECT customerid FROM orders WHERE orderid<100);
Nested Loop  (cost=10.25..443.14 rows=100 width=268) (actual time=0.488..2.591 rows=98 loops=1)
  ->  HashAggregate  (cost=10.25..11.00 rows=75 width=4) (actual time=0.464..0.661 rows=98 loops=1)
        ->  Index Scan using orders_pkey on orders  (cost=0.00..10.00 rows=100 width=4) (actual time=0.019..0.218 rows=99 loops=1)
              Index Cond: (orderid < 100)
  ->  Index Scan using customers_pkey on customers  (cost=0.00..5.75 rows=1 width=268) (actual time=0.009..0.011 rows=1 loops=98)
        Index Cond: (customers.customerid = orders.customerid)
Total runtime: 2.868 ms

dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN 
  (SELECT customerid FROM orders WHERE orderid<1000);
Hash Semi Join  (cost=54.25..800.13 rows=1000 width=268) (actual time=4.574..80.319 rows=978 loops=1)
  Hash Cond: (customers.customerid = orders.customerid)
  ->  Seq Scan on customers  (cost=0.00..676.00 rows=20000 width=268) (actual time=0.007..33.665 rows=20000 loops=1)
  ->  Hash  (cost=41.75..41.75 rows=1000 width=4) (actual time=4.502..4.502 rows=999 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 24kB
        ->  Index Scan using orders_pkey on orders  (cost=0.00..41.75 rows=1000 width=4) (actual time=0.056..2.487 rows=999 loops=1)
              Index Cond: (orderid < 1000)
Total runtime: 82.024 ms

dellstore2=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customerid IN 
  (SELECT customerid FROM orders WHERE orderid<10000);
Hash Join  (cost=443.68..1444.68 rows=8996 width=268) (actual time=79.576..157.159 rows=7895 loops=1)
  Hash Cond: (customers.customerid = orders.customerid)
  ->  Seq Scan on customers  (cost=0.00..676.00 rows=20000 width=268) (actual time=0.007..27.085 rows=20000 loops=1)
  ->  Hash  (cost=349.97..349.97 rows=7497 width=4) (actual time=79.532..79.532 rows=7895 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 186kB
        ->  HashAggregate  (cost=275.00..349.97 rows=7497 width=4) (actual time=45.130..62.227 rows=7895 loops=1)
              ->  Seq Scan on orders  (cost=0.00..250.00 rows=10000 width=4) (actual time=0.008..20.979 rows=9999 loops=1)
                    Filter: (orderid < 10000)
Total runtime: 167.882 ms