PostgreSQL中材料化视图的替代方案

Question

PostgreSQL中材料化视图的替代方案

3

我有一个包含三列的表：creationTime、number、id。这个表每隔15秒就会被填充一次。我一直在使用materialized view来跟踪重复项，如下所示：

 SELECT number, id, count(*) AS dups_count
 FROM my_table
 GROUP BY number, id HAVING count(*) > 1;

表中包含了1.5年的数千条记录。目前刷新这个物化视图需要大约2分钟的时间。我希望有一个更好的解决方案。PostgreSQL没有快速刷新物化视图的功能。

起初，我认为为表创建触发器以刷新物化视图可能是一个解决方案。但如果每15秒就有记录进来，并且物化视图需要超过2分钟才能重新计算，那么这不是一个好主意。无论如何，我不喜欢反复重算相同的数据这个想法。

有更好的解决方案吗？

- PSS

一旦您识别出这些重复项，您会怎么处理它们？ - Blue Star

我用不同的查询查找它们。在不同的数据集中，重复创建表明了其他一些事情。我想始终记录所有重复项。 - PSS

你是否曾经从那个表中删除过行？ - user330315

3个回答

1

每当存在成长的机会时，扩展的最佳方式是找到一种在增量数据上重复流程的方法。为了解释这一点，我们将被提到的表命名为“Tab”。

Tab  
Number  ID  CreationTime
Index on creationtime column.

应用增量方法的关键是拥有一个单调递增的值。这里我们用'creationtime'来表示。 (a) 创建另一张表Tab_duplicate，其中包含一个额外的列'last_compute_timestamp'。

Tab_duplicate
Number ID Duplicate_count last_compute_timestamp

(b) 在“last_compute_timestamp”列上创建索引。

(c) 运行插入语句以查找重复记录并将其插入Tab_duplicate表中，并附带last_compute_timestamp。

(d) 对于重复执行：

安装pg_cron扩展（如果不存在），并自动执行此插入操作。https://github.com/citusdata/pg_cron

https://fatdba.com/2021/07/30/pg_cron-probably-the-best-way-to-schedule-jobs-within-postgresql-database/

或者 2. 使用shell脚本/Python脚本通过操作系统crontab在数据库上执行它。

由于last_compute_timestamp在每次迭代中都被记录并在下一次重复使用，因此它将是增量的并且始终很快。

演示：

步骤1：生产表

create table tab
(
 id int,
 number int,
 creationtime timestamp
 );
create index tab_id on tab(creationtime);

步骤2：复制捕获表，包括一个首次启动记录（第一次执行后可以删除）

create table tab_duplicate
(
 id int,
 number int,
 duplicate_count int,
 last_compute_timestamp timestamp);
 create index tab_duplicate_idx on tab_duplicate(last_compute_timestamp);
 insert into tab_duplicate values(0,0,0,current_timestamp);

步骤三: 将一些重复的条目输入到生产表中

insert into tab values(1,10,current_timestamp);
 select pg_sleep(1);
 insert into tab values(1,10,current_timestamp);
 insert into tab values(1,10,current_timestamp);
 select pg_sleep(1);
 insert into tab values(2,20,current_timestamp);
 select pg_sleep(1);
 insert into tab values(2,20,current_timestamp);
 select pg_sleep(1);
 insert into tab values(3,30,current_timestamp);
 insert into tab values(3,30,current_timestamp);
 select pg_sleep(1);
 insert into tab values(4,40,current_timestamp);

验证记录：

postgres=# select * from tab;
 id | number |        creationtime
----+--------+----------------------------
  1 |     10 | 2022-01-23 19:00:37.238865
  1 |     10 | 2022-01-23 19:00:38.248574
  1 |     10 | 2022-01-23 19:00:38.252622
  2 |     20 | 2022-01-23 19:00:39.259584
  2 |     20 | 2022-01-23 19:00:40.26655
  3 |     30 | 2022-01-23 19:00:41.274673
  3 |     30 | 2022-01-23 19:00:41.279298
  4 |     40 | 2022-01-23 19:00:52.697257
(8 rows)

步骤4: 重复项已被捕获并验证。

INSERT INTO tab_duplicate
SELECT a.id,
       a.number,
       a.duplicate_count,
       b.last_compute_timestamp
FROM   (SELECT id,
               number,
               Count(*) duplicate_count
        FROM   tab,
               (SELECT Max(last_compute_timestamp) lct
                FROM   tab_duplicate) max_date
        WHERE  creationtime > max_date.lct
        GROUP  BY id,
                  number) a,
       (SELECT Max(creationtime) last_compute_timestamp
        FROM   tab,
               (SELECT Max(last_compute_timestamp) lct
                FROM   tab_duplicate) max_date
        WHERE  creationtime > max_date.lct) b;

执行：

postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-#        a.number,
postgres-#        a.duplicate_count,
postgres-#        b.last_compute_timestamp
postgres-# FROM   (SELECT id,
postgres(#                number,
postgres(#                Count(*) duplicate_count
postgres(#         FROM   tab,
postgres(#                (SELECT Max(last_compute_timestamp) lct
postgres(#                 FROM   tab_duplicate) max_date
postgres(#         WHERE  creationtime > max_date.lct
postgres(#         GROUP  BY id,
postgres(#                   number) a,
postgres-#        (SELECT Max(creationtime) last_compute_timestamp
postgres(#         FROM   tab,
postgres(#                (SELECT Max(last_compute_timestamp) lct
postgres(#                 FROM   tab_duplicate) max_date
postgres(#         WHERE  creationtime > max_date.lct) b;
INSERT 0 4
postgres=#

验证：

postgres=# select * from tab_duplicate;
 id | number | duplicate_count |   last_compute_timestamp
----+--------+-----------------+----------------------------
  0 |      0 |               0 | 2022-01-23 19:00:25.779671
  3 |     30 |               2 | 2022-01-23 19:00:52.697257
  1 |     10 |               3 | 2022-01-23 19:00:52.697257
  4 |     40 |               1 | 2022-01-23 19:00:52.697257
  2 |     20 |               2 | 2022-01-23 19:00:52.697257
(5 rows)

步骤5: 将更多的重复项插入到生产表中

insert into tab values(5,50,current_timestamp);
 select pg_sleep(1);
 insert into tab values(5,50,current_timestamp);
 select pg_sleep(1);
 insert into tab values(5,50,current_timestamp);
 select pg_sleep(1);
 insert into tab values(6,60,current_timestamp);
 select pg_sleep(1);
 insert into tab values(6,60,current_timestamp);
 select pg_sleep(1);

步骤6: 执行相同的重复捕获SQL将仅捕获生产表中的增量记录。


INSERT INTO tab_duplicate
SELECT a.id,
       a.number,
       a.duplicate_count,
       b.last_compute_timestamp
FROM   (SELECT id,
               number,
               Count(*) duplicate_count
        FROM   tab,
               (SELECT Max(last_compute_timestamp) lct
                FROM   tab_duplicate) max_date
        WHERE  creationtime > max_date.lct
        GROUP  BY id,
                  number) a,
       (SELECT Max(creationtime) last_compute_timestamp
        FROM   tab,
               (SELECT Max(last_compute_timestamp) lct
                FROM   tab_duplicate) max_date
        WHERE  creationtime > max_date.lct) b;

执行：

postgres=# INSERT INTO tab_duplicate
postgres-# SELECT a.id,
postgres-#        a.number,
postgres-#        a.duplicate_count,
postgres-#        b.last_compute_timestamp
postgres-# FROM   (SELECT id,
postgres(#                number,
postgres(#                Count(*) duplicate_count
postgres(#         FROM   tab,
postgres(#                (SELECT Max(last_compute_timestamp) lct
postgres(#                 FROM   tab_duplicate) max_date
postgres(#         WHERE  creationtime > max_date.lct
postgres(#         GROUP  BY id,
postgres(#                   number) a,
postgres-#        (SELECT Max(creationtime) last_compute_timestamp
postgres(#         FROM   tab,
postgres(#                (SELECT Max(last_compute_timestamp) lct
postgres(#                 FROM   tab_duplicate) max_date
postgres(#         WHERE  creationtime > max_date.lct) b;
INSERT 0 2

验证：

postgres=# select * from tab_duplicate;
 id | number | duplicate_count |   last_compute_timestamp
----+--------+-----------------+----------------------------
  0 |      0 |               0 | 2022-01-23 19:00:25.779671
  3 |     30 |               2 | 2022-01-23 19:00:52.697257
  1 |     10 |               3 | 2022-01-23 19:00:52.697257
  4 |     40 |               1 | 2022-01-23 19:00:52.697257
  2 |     20 |               2 | 2022-01-23 19:00:52.697257
  5 |     50 |               3 | 2022-01-23 19:02:37.884417
  6 |     60 |               2 | 2022-01-23 19:02:37.884417
(7 rows)

这个重复捕获总是很快的，因为有两个原因：

它仅在您计划的最后一段时间的增量数据上工作。
扫描表以查找最大时间戳只发生在单列索引上（仅索引扫描）。

从执行计划中：

->  Index Only Scan Backward using tab_duplicate_idx on tab_duplicate tab_duplicate_2  (cost=0.15..77.76 rows=1692 width=8)

注意：如果您在表格tab_duplicate中有长时间的重复记录，您可以在定期时间内将记录去重，例如在一天结束时进行，这样会很快，因为TAB_DUPLICATE是一个聚合小表，而且该表对您的应用程序是离线的，而TAB是您的生产表，具有大量累积数据。

此外，在生产表上设置触发器也是可行的解决方案，但这会增加生产事务的开销，因为每次插入都需要执行触发器。

- rajorshi

1

有两种方法可以考虑：

1.创建一个带有（number，id）列的辅助表。添加触发器，以便每当要插入重复行到my_table时，它也会被插入到这个辅助表中。这样，你就可以在辅助表中获得所需的数据，并且除非你有大量这些重复项，否则不会占用太多空间。

2.向my_table添加一个新列，例如时间戳，以区分重复项。在（number，id）列上为my_table添加唯一约束条件，其中新列为空。然后，您可以更改插入操作，包括ON CONFLICT子句，以便如果要插入重复项，则将其时间戳设置为当前时间。当您想要搜索重复项时，您可以使用新列进行查询。

- Blue Star

谢谢！第二个解决方案对我不适用，因为我无法访问记录的插入。它是由第三方软件插入的。第一个解决方案确实很有道理。我找不到有关创建检查每个插入的特定触发器的任何信息。顺便说一下，我正在寻找双重重复，这意味着“编号和ID”的组合必须相同。谢谢！ - PSS

你需要一个BEFORE INSERT触发器，它查询my_table以查看是否已经存在具有相同编号和ID的行（如果是，则插入到辅助表中）。确保在my_table的这两列上有索引，否则触发器将会非常慢。 - Blue Star

我已经在它们两个上都有索引了。我会尝试并回报。谢谢! - PSS

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user330315 · Accepted Answer

触发器增加重复计数可能是一个解决方案：

create table duplicates
(
  number int, 
  id int, 
  dups_count int,
  primary key (number, id)
);

主键将允许有效的“UPSERT”，在重复情况下增加“dups_count”计数。

然后创建一个触发器，在每次向基础表中插入一行时更新该表：

create function increment_dupes()
  returns trigger
as
$$
begin
   insert into duplicates (number, id, dups_count)
   values (new.number, new.id, 1)
   on conflict (number,id)
   do update 
     set dups_count = duplicates.dups_count + 1;
   return new;
end
$$
language plpgsql;

create trigger update_dups_count
  after insert on my_table
  for each row
  execute function increment_dupes();

每次你向my_table插入数据，都会在duplicates中创建一个新行，或者当前的dups_count将被增加。如果你从my_table中删除或更新行，你也需要一个触发器来完成。然而，对于并发操作，为UPDATE或DELETE更新计数并不完全安全。但使用INSERT ON CONFLICT则可以解决这个问题。

触发器确实会带来一些性能开销，因此你需要测试一下是否超出了你的要求。