PostgreSQL：DISTINCT和GROUP BY哪个更快？

Question

PostgreSQL：DISTINCT和GROUP BY哪个更快？

4

有两种方式可以完成同样的事情。

例如：从人员数据库中获取不同的姓名。

第一种方式是：

SELECT name 
FROM person 
GROUP BY name

具有相同的结果：

SELECT DISTINCT name 
FROM person

我很好奇，PostgreSQL SQL引擎处理命令的方式是否存在不同，并且哪种方式更快，或者它们执行的是相同的操作？

- Герман Ганыс

1

DISTINCT更好，GROUP BY用于SUM / AVERAGE /或其他计算组。 - Josua Marcel C

4

理论上它们应该是相同的，但是GROUP BY可以使用并行查询而DISTINCT不能。因此，在某些情况下，GROUP BY可能会更快。 - user330315

2

还有第三种方法：SELECT name FROM person UNION SELECT name FROM person。但我会选择 SELECT DISTINCT。 - jarlh

为什么必须快？这是一个琐碎的问题。（在一个非琐碎的查询中，“DISTINCT”会引起警觉） - wildplasser

这个问题的答案必然取决于查询运行所在的数据和环境。我建议您自己运行基准测试。 - Bob Jarvis - Слава Україні

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- LukStorms · Accepted Answer

对于只有少量记录（例如10万条），这并不重要，两者都将使用相同的HashAggregate方法。

然后高尔夫球编码者会更喜欢DISTINCT，因为它具有稍微更短的语法。GROUP BY更适合与聚合函数一起使用，如MAX、SUM、COUNT、AVG等。

但是对于较大的记录集，存在差异。

例如，在此演示中：

create table Persons (
 Name varchar(30)
)

INSERT INTO Persons (Name)
SELECT
    arrays.firstnames[s.a % ARRAY_LENGTH(arrays.firstnames,1) + 1]
 || arrays.lastnames[s.a % ARRAY_LENGTH(arrays.lastnames,1) + 1] AS name
FROM     generate_series(1,600000) AS s(a) -- number of names to generate
CROSS JOIN(
    SELECT ARRAY[
    'Adam','Bill','Bob','Calvin','Donald','Dwight','Frank','Fred','George','Howard',
    'James','John','Jacob','Jack','Martin','Matthew','Max','Michael','Lukas', 
    'Paul','Peter','Phil','Roland','Ronald','Samuel','Steve','Theo','Warren','William',
    'Abigail','Alice','Allison','Amanda','Anne','Barbara','Betty','Carol','Cleo','Donna',
    'Jane','Jennifer','Julie','Martha','Mary','Melissa','Patty','Sarah','Simone','Susan'
    ] AS firstnames,
    ARRAY[
        'Matthews','Smith','Jones','Davis','Jacobson','Williams','Donaldson','Maxwell','Peterson','Storms','Stevens',
        'Franklin','Washington','Jefferson','Adams','Jackson','Johnson','Lincoln','Grant','Fillmore','Harding','Taft',
        'Truman','Nixon','Ford','Carter','Reagan','Bush','Clinton','Hancock'
    ] AS lastnames
) AS arrays

select count(*) from Persons

|  count |
| -----: |
| 600000 |

explain analyse
select distinct Name from Persons

| QUERY PLAN                                                                                                           |
| :------------------------------------------------------------------------------------------------------------------- |
| HashAggregate  (cost=6393.82..6395.82 rows=200 width=78) (actual time=194.609..194.757 rows=1470 loops=1)            |
|   Group Key: name                                                                                                    |
|   ->  Seq Scan on persons  (cost=0.00..5766.66 rows=250866 width=78) (actual time=0.030..61.243 rows=600000 loops=1) |
| Planning time: 0.259 ms                                                                                              |
| Execution time: 194.898 ms                                                                                           |

explain analyse
select Name from Persons group by Name

| QUERY PLAN                                                                                                                                      |
| :---------------------------------------------------------------------------------------------------------------------------------------------- |
| Group  (cost=5623.88..5625.88 rows=200 width=78) (actual time=226.358..227.145 rows=1470 loops=1)                                               |
|   Group Key: name                                                                                                                               |
|   ->  Sort  (cost=5623.88..5624.88 rows=400 width=78) (actual time=226.356..226.596 rows=4410 loops=1)                                          |
|         Sort Key: name                                                                                                                          |
|         Sort Method: quicksort  Memory: 403kB                                                                                                   |
|         ->  Gather  (cost=5564.59..5606.59 rows=400 width=78) (actual time=206.700..219.546 rows=4410 loops=1)                                  |
|               Workers Planned: 2                                                                                                                |
|               Workers Launched: 2                                                                                                               |
|               ->  Partial HashAggregate  (cost=4564.59..4566.59 rows=200 width=78) (actual time=196.862..197.072 rows=1470 loops=3)             |
|                     Group Key: name                                                                                                             |
|                     ->  Parallel Seq Scan on persons  (cost=0.00..4303.27 rows=104528 width=78) (actual time=0.039..66.876 rows=200000 loops=3) |
| Planning time: 0.069 ms                                                                                                                         |
| Execution time: 227.301 ms                                                                                                                      |

db<>fiddle 这里

所以在这个例子中，使用DISTINCT仍然更快。
但是由于GROUP BY开始并行工作，这也可能取决于托管postgresql的服务器。