无法使用UPDATE和子查询来使Postgresql的LIMIT起作用

Question

无法使用UPDATE和子查询来使Postgresql的LIMIT起作用

3

为什么以下查询返回的记录数超过了限制（limit）？实际上有大约272条记录存在，但是当提供一个小于该数的限制时，它总是更新并返回所有272条记录。

UPDATE thing 
SET status = 'status_2' 
WHERE thing.thing_id in (
  SELECT thing.thing_id 
  FROM thing 
  WHERE thing.status = ('status_1') AND 
  thing.c_id = ('c_id___1') AND 
  thing.name LIKE ('protocol://some/url/location/%')
  ORDER BY thing.status_1_date LIMIT (150) FOR UPDATE
) AND 
thing.c_id = ('c_id___1') AND 
thing.name LIKE ('protocol://some/url/location/%') 
RETURNING *

用另一种尝试也观察到相同的结果。

UPDATE thing 
SET status = 'status_2' 
FROM (
  SELECT thing.thing_id 
  FROM thing 
  WHERE thing.status = ('status_1') AND 
  thing.c_id = ('c_id___1') AND 
  thing.name LIKE ('protocol://some/url/location/%')
  ORDER BY thing.status_1_date LIMIT (150) FOR UPDATE
) AS temp 
WHERE thing.thing_id = temp.thing_id AND 
thing.c_id = ('c_id___1') AND 
thing.name LIKE ('protocol://some/url/location/%') 
RETURNING *

问题概述：我需要选择最多limit个thing_ids，并更新包含这个thing_ids子集的所有记录。对于我正在测试的这组记录，没有重复的thing_ids，但对于其他记录可能存在重复。

更新：为了证明没有重复，我运行了以下查询：

SELECT COUNT(thing.thing_id)
FROM thing
WHERE thing.c_id = ('c_id___1') AND 
thing.name LIKE ('protocol://some/url/location/%'))
GROUP BY thing.thing_id```

结果返回272。

更新：我能够得出一个可行的解决方案，但我将保持问题的开放，因为我想了解之前两个实现的问题在哪里。

新的实现：

UPDATE thing 
SET status = 'status_2' 
WHERE thing.thing_id in (
  SELECT a.thing_id 
  FROM thing a INNER 
  JOIN thing b on b.thing_id = a.thing_id 
  WHERE a.status = ('status_1') AND a.c_id = ('c_id___1') AND a.name LIKE ('protocol://some/url/location/%') 
  ORDER BY a.status_1_date, a.name LIMIT (150) FOR UPDATE
) RETURNING *

更新：事实证明，连接查询也不是我所需要的。如果唯一的thing.thing_ids的数量等于行数，它只会更新LIMIT行。如果有重复的thing_ids，它会排除thing_ids，以便更新的行数少于限制。

这是连接版本查询的查询计划：查询计划

Update on thing  (cost=31977.16..47319.15 rows=1 width=764)
"  ->  Hash Semi Join  (cost=31977.16..47319.15 rows=1 width=764)"
"        Hash Cond: ((thing.thing_id)::text = (""ANY_subquery"".thing_id)::text)"
"        ->  Seq Scan on thing  (cost=0.00..14776.23 rows=215523 width=190)"
"        ->  Hash  (cost=31977.15..31977.15 rows=1 width=92)"
"              ->  Subquery Scan on ""ANY_subquery""  (cost=31977.13..31977.15 rows=1 width=92)"
"                    ->  Limit  (cost=31977.13..31977.14 rows=1 width=134)"
"                          ->  LockRows  (cost=31977.13..31977.14 rows=1 width=134)"
"                                ->  Sort  (cost=31977.13..31977.13 rows=1 width=134)"
"                                      Sort Key: a.status_1_date, a.name"
"                                      ->  Hash Join  (cost=16392.67..31977.12 rows=1 width=134)"
"                                            Hash Cond: ((b.thing_id)::text = (a.thing_id)::text)"
"                                            ->  Seq Scan on thing b  (cost=0.00..14776.23 rows=215523 width=40)"
"                                            ->  Hash  (cost=16392.65..16392.65 rows=1 width=128)"
"                                                  ->  Seq Scan on thing a  (cost=0.00..16392.65 rows=1 width=128)"
"                                                        Filter: (((name)::text ~~ 'protocol://path/to/thing/%'::text) AND ((c_id)::text = 'c_id_1'::text) AND ((status)::text = 'status_2'::text))"

创建表的查询：

CREATE TABLE IF NOT EXISTS thing (
  name VARCHAR(255) PRIMARY KEY,
  thing_id VARCHAR(255),
  c_id VARCHAR(255),
  status VARCHAR(255),
  etag VARCHAR(255),
  last_modified VARCHAR(255),
  size VARCHAR(255),
  status_1_date DATE
);

插入示例数据的查询语句：

INSERT INTO thing (name,thing_id, c_id,status, etag, last_modified, size, status_1_date)
values 
('protocol://full/path/to/thing/thing_1.file1', 'thing_1','c_id', 'status_1', 'etag', '1111', 200, '2023-09-29 09:00:01'),
('protocol://full/path/to/thing/thing_1.file2', 'thing_1','c_id', 'status_1', 'etag', '1111', 200, '2023-09-29 09:00:02'),
('protocol://full/path/to/thing/thing_2.file1', 'thing_2','c_id', 'status_1', 'etag', '1111', 200, '2023-09-29 09:00:03'),
('protocol://full/path/to/thing/thing_2.file2', 'thing_2','c_id', 'status_1', 'etag', '1111', 200, '2023-09-29 09:00:04'),
('protocol://full/path/to/thing/thing_2.file3', 'thing_2','c_id', 'status_1', 'etag', '1111', 200, '2023-09-29 09:00:05'),
('protocol://full/path/to/thing/thing_3.file1', 'thing_3','c_id', 'status_1', 'etag', '1111', 200, '2023-09-29 09:00:06');

使用LIMIT 2限制返回结果，期望的结果如下：
查询1：

('protocol://full/path/to/thing/thing_1.file1', 'thing_1','c_id', 'status_2', 'etag', '1111', 200, '2023-09-29 09:00:01'),
('protocol://full/path/to/thing/thing_1.file2', 'thing_1','c_id', 'status_2', 'etag', '1111', 200, '2023-09-29 09:00:02'),
('protocol://full/path/to/thing/thing_2.file1', 'thing_2','c_id', 'status_2', 'etag', '1111', 200, '2023-09-29 09:00:03'),
('protocol://full/path/to/thing/thing_2.file2', 'thing_2','c_id', 'status_2', 'etag', '1111', 200, '2023-09-29 09:00:04'),
('protocol://full/path/to/thing/thing_2.file3', 'thing_2','c_id', 'status_2', 'etag', '1111', 200, '2023-09-29 09:00:05'),

查询2：

('protocol://full/path/to/thing/thing_3.file1', 'thing_3','c_id', 'status_2', 'etag', '1111', 200, '2023-09-29 09:00:06')

- Cogito Ergo Sum

2

你确定 thing.thing_id 是唯一的吗？如果有重复的话，那可能就能解释这个问题了。
你尝试过运行 select 子查询并验证其结果吗？

- undefined

@Schwern，我已经更新了问题，并提供了一个查询来代表我要解决的情况。如果有任何不清楚的地方，请告诉我。 - undefined

1

@CogitoErgoSum 所以 thing.thing_id 不是唯一的。这很可能会导致你的问题。让我试试一些方法。你能提供一个模式吗？ - undefined

1

在这个行为的帖子中找到了相关讨论。 - undefined

正在努力获取模式。我无法转储我实际工作的数据库的模式。URL是匹配的，只是我自己打错了一个字。 - undefined

显示剩余8条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Schwern · Accepted Answer

这是一个子查询中的select for update。pgsql-bugs上的一个帖子解释了这个问题...

问题在于子查询原则上对外部查询的每一行都会重新执行一次，并且你无法保证每次执行都返回相同的单行结果。普通的查询执行会提供这样的保证，但是你正在对一个正在被并发修改的表使用FOR UPDATE，包括这个查询本身，因此这个保证就消失了。

你可能会认为，由于子查询与外部查询无关，所以没有必要执行多次...但这是一种优化，不是保证的一部分。如果我们无法在你的系统上看到这个查询的EXPLAIN结果，我们就无法知道是否是这样执行的（尽管你的投诉表明可能不是这样）。

而且我们可以从explain中看到，这个子查询是在一个嵌套循环中。

QUERY PLAN
Update on thing  (cost=16.14..31.42 rows=1 width=94)
  ->  Nested Loop Semi Join  (cost=16.14..31.42 rows=1 width=94)
        Join Filter: (thing.thing_id = "ANY_subquery".thing_id)
        ->  Seq Scan on thing  (cost=0.00..15.25 rows=1 width=38)
              Filter: ((name ~~ 'protocol://full/path/to/thing/%'::text) AND (c_id = 'c_id'::text))
        ->  Subquery Scan on "ANY_subquery"  (cost=16.14..16.16 rows=1 width=88)
              ->  Limit  (cost=16.14..16.15 rows=1 width=46)
                    ->  LockRows  (cost=16.14..16.15 rows=1 width=46)
                          ->  Sort  (cost=16.14..16.14 rows=1 width=46)
                                Sort Key: thing_1.status_1_date
                                ->  Seq Scan on thing thing_1  (cost=0.00..16.12 rows=1 width=46)
                                      Filter: ((name ~~ 'protocol://full/path/to/thing/%'::text) AND (status = 'status_1'::text) AND (c_id = 'c_id'::text))

在帖子的后面有一个更详细的解释。

seqscan+sort将查找并返回满足“过滤”条件的所有行，查询开始时的状态。LockRows节点将获取这些行中的第一行并锁定它，这将包括查找和锁定由于并发更新而存在的任何更新版本的行。如果存在更新版本，则它会重新检查该版本是否仍满足过滤条件（通过一些我们不必在这里详细讨论的魔法）。如果满足条件，它将将该行返回给LIMIT节点，LIMIT节点将返回该行并声明查询完成，因此我们找到并锁定了一行。然而，如果第一行已更新为不满足过滤条件的状态，LockRows节点将前进到seqscan+sort输出的下一行，并锁定并重新测试该行。这个过程会重复，直到找到一行在锁定后仍然满足过滤条件。

因此，很容易看出并发更新可能会导致此查询锁定N行，其中N大于1。但仅此而已并不能很好地解释查询锁定所有行的情况，正如您所述。所有行都必须同时更新为不再满足过滤条件的状态，这似乎不太可能发生，因为一些其他事务分别执行相同类型的查询。

他们的建议是用CTE替换它。这通常是一个很好的建议，使用CTE而不是子查询FROM，即使只是因为它更容易阅读。

注意，如果给thing一个主键，你的查询将更快更简单。这样你就可以识别每一行，而不需要重复使用where子句。演示。

with these_things as (
  SELECT thing.id 
  FROM thing 
  WHERE thing.status = ('status_1') AND 
  thing.c_id = ('c_id') AND 
  thing.name LIKE ('protocol://full/path/to/thing/%')
  ORDER BY thing.status_1_date LIMIT (2)
)
UPDATE thing 
SET status = 'status_2'
from these_things
where thing.id = these_things.id
RETURNING *