PostgreSQL按组求和

7

在PostgreSQL中,我一直为这个问题而苦恼。我有一个名为test的表格,它有2列:- idcontent。比如:

create table test (id integer, 
                   content varchar(1024));

insert into test (id, content) values 
    (1, 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.'),
    (2, 'Lorem Ipsum has been the industrys standard dummy text '),
    (3, 'ever since the 1500s, when an unknown printer took a galley of type and scrambled it to'),
    (4, 'make a type specimen book.'),
    (5, 'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.'),
    (6, 'It was popularised in the 1960s with the release of Letraset sheets containing Lorem '),
    (7, 'Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker'),
    (8, ' including versions of Lorem Ipsum.');

如果我运行以下查询...
select id, length(content) as characters from test order by id

...然后我得到:-

id | characters
---+-----------
 1 |         74
 2 |         55
 3 |         87
 4 |         26
 5 |        120
 6 |         85
 7 |         87
 8 |         35

我想要做的是将id分组成行,使得内容的总和超过一个阈值。例如,如果该阈值为100,则期望的结果如下:

-

ids | characters
----+-----------   
1,2 |        129
3,4 |        113    
5   |        120
6,7 |        172    
8   |         35 

注意(1): - 查询不需要生成一个characters列,只需要ids - 它们在这里用来传达它们都超过了100 - 除了最后一行的值为35

注意(2): - ids可以是逗号分隔的字符串或PostgreSQL数组 - 类型不如值重要。

我可以使用窗口函数来实现这个功能还是需要像lateral join这样更复杂的东西?


你的问题的答案是你需要使用更加复杂的东西: 递归CTE。性能可能不会特别好。 - Gordon Linoff
我已经接受了@Abelisto的答案,因为这是我在我的代码中使用的答案。 - bobmarksie
然而,@Gordon的回答让我印象深刻,仅仅通过尝试理解它,我就学到了很多。谢谢大家! - bobmarksie
3个回答

5
这种问题需要使用递归CTE(或类似的功能)。以下是一个示例:
with recursive t as (
      select id, length(content) as len,
             row_number() over (order by id) as seqnum
      from test 
     ),
     cte(id, len, ids, seqnum, grp) as (
      select id, len, len as cumelen, t.id::text, 1::int as seqnum, 1 as grp
      from t
      where seqnum = 1
      union all
      select t.id,
             t.len,
             (case when cte.cumelen >= 100 then t.len else cte.cumelen + t.len end) as cumelen,
             (case when cte.cumelen >= 100 then t.id::text else cte.ids || ',' || t.id::text end) as ids,
             t.seqnum
             (case when cte.cumelen >= 100 then cte.grp + 1 else cte.grp end) as ids,
      from t join
           cte
           on cte.seqnum = t.seqnum - 1
     )
select grp, max(ids)
from cte
group by grp;

这是一个小型的工作示例:
with recursive test as (
      select 1 as id, 'abcd'::text as content union all
      select 2 as id, 'abcd'::text as content union all
      select 3 as id, 'abcd'::text as content 
     ),
     t as (
      select id, length(content) as len,
             row_number() over (order by id) as seqnum
      from test 
     ),
     cte(id, len, cumelen, ids, seqnum, grp) as (
      select id, len, len as cumelen, t.id::text, 1::int as seqnum, 1 as grp
      from t
      where seqnum = 1
      union all
      select t.id,
             t.len,
             (case when cte.cumelen >= 5 then t.len else cte.cumelen + t.len end) as cumelen,
             (case when cte.cumelen >= 5 then t.id::text else cte.ids || ',' || t.id::text end) as ids,
             t.seqnum::int,
             (case when cte.cumelen >= 5 then cte.grp + 1 else cte.grp end)
      from t join
           cte
           on cte.seqnum = t.seqnum - 1
     )
select grp, max(ids)
from cte
group by grp;

1
你真是太快了... 我也想到了这个,正在 SQLfiddle 上编写一个递归 CTE。无人能敌的 Gordon :) - Thomas G
当我使用已创建的表的第一个查询时,我会得到以下错误信息:ERROR: recursive query "t" does not have the form non-recursive-term UNION [ALL] recursive-termSQL Fiddle - Clodoaldo Neto
@ClodoaldoNeto . . . 第二个查询在Postgres中肯定可以运行。我不知道为什么你会收到那个错误。 - Gordon Linoff
或者只需运行提供的创建/插入脚本。 - Clodoaldo Neto
你的第二个查询使用了 union all,而第一个没有使用,应该使用第一个。 - Clodoaldo Neto
显示剩余6条评论

2
使用存储函数可以避免(有时)令人头疼的查询。
create or replace function fn_foo(ids out int[], characters out int) returns setof record language plpgsql as $$
declare
  r record;
  threshold int := 100;
begin
  ids := '{}'; characters := 0;
  for r in (
    select id, coalesce(length(content),0) as lng
    from test order by id)
  loop
    characters := characters + r.lng;
    ids := ids || r.id;
    if characters > threshold then
      return next;
      ids := '{}'; characters := 0;
    end if;
  end loop;
  if ids <> '{}' then
    return next;
  end if;
end $$;

select * from fn_foo();

╔═══════╤════════════╗
║  ids  │ characters ║
╠═══════╪════════════╣
║ {1,2}129 ║
║ {3,4}113 ║
║ {5}120 ║
║ {6,7}172 ║
║ {8}35 ║
╚═══════╧════════════╝
(5 rows)

太好了,这个可行!由于它只是简单地循环,所以性能应该也相当不错。我还在开头添加了一个 threshold integer default 100, 参数,这样我就可以覆盖阈值了! - bobmarksie

1

我这里有一个使用LEAD()窗口函数的查询

SELECT id || ',' || next_id, characters + next_characters total_characters 
FROM  (SELECT id, characters, row_num, 
              CASE 
                WHEN row_num % 2 = 0 
                     AND characters < 100 THEN Lead(id) OVER(ORDER BY id) 
                ELSE NULL 
              END next_id, 
              CASE 
                WHEN row_num % 2 = 0 
                     AND characters < 100 THEN NULL 
                ELSE Lead(characters) OVER(ORDER BY id) 
              END AS next_characters 
       FROM  (SELECT id, 
                     Length(content)  AS characters, 
                     Row_number() 
                       OVER( 
                         ORDER BY id) row_num 
              FROM   test 
              ORDER  BY id)) 
WHERE  next_id IS NULL;

希望这能对你有所帮助。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接