使用Tablefunc在多列上进行数据透视

22
有人使用过tablefunc来对多个变量进行数据透视吗?而不仅仅是使用行名
我需要对数十亿行数据进行这样的操作,并且我正在测试将这些数据存储在长格式和宽格式中,看看是否可以使用tablefunc比常规聚合函数更高效地从长格式转换为宽格式。我每分钟会有大约100个实体进行约300次测量。我们经常需要比较给定实体在给定秒钟内进行的不同测量,因此我们经常需要转换为宽格式。此外,对于特定实体进行的测量具有高度的变异性。
作为示例数据,我已经编辑了在这个问题的回答中使用的数据。
 CREATE TEMP TABLE t4 (
  timeof   timestamp
 ,entity    character
 ,status    integer
 ,ct        integer);

 INSERT INTO t4 VALUES 
  ('2012-01-01', 'a', 1, 1)
 ,('2012-01-01', 'a', 0, 2)
 ,('2012-01-02', 'b', 1, 3)
 ,('2012-01-02', 'c', 0, 4);

 SELECT * FROM crosstab(
     'SELECT timeof, entity, status, ct
      FROM   t4
      ORDER  BY 1,2,3'
     ,$$VALUES (1::text), (0::text)$$)
 AS ct ("Section" timestamp, "Attribute" character, "1" int, "0" int);

返回:

部分 属性 1 0
2012-01-01 00:00:00 a 1 2
2012-01-02 00:00:00 b 3 4

根据文档所述,额外列,也就是'属性',被假定为每个行名称,也就是'部分',都是相同的。因此,即使'实体'在该'timeof'值上也有一个'c'值,它仍然为第二行报告了b

期望输出:

部分 属性 1 0
2012-01-01 00:00:00 a 1 2
2012-01-02 00:00:00 b 3
2012-01-02 00:00:00 c 4

我使用了一些资源:12

我该怎么做?


很好的问题,附有(非常感谢)可工作的测试用例,清晰地展示了问题。只缺少PostgreSQL版本。我假设当前版本为9.2。但是解决方案对于最近几个主要版本都是相同的。 - Erwin Brandstetter
该解决方案仅适用于此问题,我指的是仅限于“此问题”,因为它不适用于具有两个列索引的数据透视表。因此,标题不正确,或者答案应该是第二个(id = answer-15559942)。 - gavioto
3个回答

19
您的查询问题在于bc共享相同的时间戳2012-01-02 00:00:00,并且您在查询中首先使用了timestamptimeof,因此即使您添加了粗体强调,bc仍然只是落在相同组2012-01-02 00:00:00中的额外列。只有第一个(b)被返回,因为(引用手册)

row_name列必须放在第一位。 categoryvalue列必须是最后两列,按照这个顺序。在row_namecategory之间的任何列都被视为“额外”列。假定“额外”列对于具有相同row_name值的所有行都是相同的。

加粗是我的。
只需将第一列和第二列的顺序颠倒,使entity成为行名称,就可以按预期工作:
SELECT *
FROM   crosstab(
   'SELECT entity, timeof, status, ct
    FROM   t4
    ORDER  BY 1'
 , 'VALUES (1), (0)'
   ) AS ct (
      "Attribute" character
    , "Section" timestamp
    , "status_1" int
    , "status_0" int
      );

entity 必须是唯一的。

重申

  • row_name 首先
  • (可选) extra接下来
  • category (由第二个参数定义) 和 value 最后

额外的列从每个 row_name 分区的 第一行 填充。来自其他行的值将被忽略,每个 row_name 只有一列要填充。通常情况下,对于一个 row_name 的每一行,这些值都应该是相同的,但这取决于您。

基础知识:

针对不同的设置 在您的答案中

SELECT localt, entity
     , msrmnt01, msrmnt02, msrmnt03, msrmnt04, msrmnt05  -- , more?
FROM   crosstab(
  'SELECT dense_rank() OVER (ORDER BY localt, entity)::int AS row_name
        , localt, entity -- additional columns
        , msrmnt, val
   FROM   test
-- WHERE  ???   -- instead of LIMIT at the end
   ORDER  BY localt, entity, msrmnt
-- LIMIT ???'   -- instead of LIMIT at the end
, 'SELECT generate_series(1,5)'  -- more?
   ) AS ct (row_name int, localt timestamp, entity int
          , msrmnt01 float8, msrmnt02 float8, msrmnt03 float8, msrmnt04 float8, msrmnt05 float8 -- , more?
            )
LIMIT 1000  -- ?!

难怪你的测试查询表现糟糕。你的测试设置有14M行,而且在使用LIMIT 1000抛弃大部分数据之前,你处理了所有的数据。如果要减少结果集,请在源查询中添加WHERE条件或LIMIT

此外,你使用的数组也是不必要的昂贵开销。我用dense_rank()生成了一个替代行名。

db<>fiddle 这里 - 使用更简单的测试设置和更少的行。


嗨,Erwin,非常感谢您的回复。我会授予您信用,但我认为我的询问中存在一些歧义。我有时间时会再次发布。 - ideamotor
1
@AndreSilva:我相信一定有方法。请用“问题”提出您的问题。评论不是提问的地方。您可以随时链接到此处以获取上下文并在此处放置一个评论链接以引起我的注意。 - Erwin Brandstetter

13
在我的原问题中,我应该使用这个作为我的样本数据:
CREATE TEMP TABLE t4 (
 timeof    date
,entity    integer
,status    integer
,ct        integer);
INSERT INTO t4 VALUES 
 ('2012-01-01', 1, 1, 1)
,('2012-01-01', 1, 0, 2)
,('2012-01-01', 3, 0, 3)
,('2012-01-02', 2, 1, 4)
,('2012-01-02', 3, 1, 5)
,('2012-01-02', 3, 0, 6);

我必须在时间和实体两个维度上进行旋转。由于 tablefunc 只使用一个列来进行旋转,因此您需要找到一种将两个维度都放入该列的方法。(http://www.postgresonline.com/journal/categories/24-tablefunc)。我选择了数组,就像链接中的示例。

SELECT (timestamp 'epoch' + row_name[1] * INTERVAL '1 second')::date 
           as localt, 
           row_name[2] As entity, status1, status0
FROM crosstab('SELECT ARRAY[extract(epoch from timeof), entity] as row_name,
                    status, ct
               FROM t4 
               ORDER BY timeof, entity, status'
     ,$$VALUES (1::text), (0::text)$$) 
          as ct (row_name integer[], status1 int, status0 int)

就我个人而言,我尝试使用字符数组,到目前为止看起来对我的设置来说更快;9.2.3 PostgreSQL。

这是结果和期望的输出。

localt           | entity | status1 | status0
--------------------------+---------+--------
2012-01-01       |   1    |    1    |   2
2012-01-01       |   3    |         |   3
2012-01-02       |   2    |    4    |  
2012-01-02       |   3    |    5    |   6

我很好奇这个在更大的数据集上的表现如何,等待后续报告。


非常有趣。我认为这应该被标记为有效答案。第一个现在https://dev59.com/VWUp5IYBdhLWcg3wPFtK不回答问题,因为同时具有值的实体将被忽略。 - gavioto

2

好的,我在一个更接近我的用例的表格上运行了这个操作。要么我做错了,要么crosstab不适合我的情况。

首先,我创建了一些类似的数据:

CREATE TABLE public.test (
    id serial primary key,
    msrmnt integer,
    entity integer,
    localt timestamp,
    val    double precision
);
CREATE INDEX ix_test_msrmnt
   ON public.test (msrmnt);
 CREATE INDEX ix_public_test_201201_entity
   ON public.test (entity);
CREATE INDEX ix_public_test_201201_localt
  ON public.test (localt);
insert into public.test (msrmnt, entity, localt, val)
select *
from(
SELECT msrmnt, entity, localt, random() as val 
FROM generate_series('2012-01-01'::timestamp, '2012-01-01 23:59:00'::timestamp, interval '1 minutes') as localt
join 
(select *
FROM generate_series(1, 50, 1) as msrmnt) as msrmnt
on 1=1
join 
(select *
FROM generate_series(1, 200, 1) as entity) as entity
on 1=1) as data;

然后我运行了交叉表代码几次:

explain analyze
SELECT (timestamp 'epoch' + row_name[1] * INTERVAL '1 second')::date As localt, row_name[2] as entity
    ,msrmnt01,msrmnt02,msrmnt03,msrmnt04,msrmnt05,msrmnt06,msrmnt07,msrmnt08,msrmnt09,msrmnt10
    ,msrmnt11,msrmnt12,msrmnt13,msrmnt14,msrmnt15,msrmnt16,msrmnt17,msrmnt18,msrmnt19,msrmnt20
    ,msrmnt21,msrmnt22,msrmnt23,msrmnt24,msrmnt25,msrmnt26,msrmnt27,msrmnt28,msrmnt29,msrmnt30
    ,msrmnt31,msrmnt32,msrmnt33,msrmnt34,msrmnt35,msrmnt36,msrmnt37,msrmnt38,msrmnt39,msrmnt40
    ,msrmnt41,msrmnt42,msrmnt43,msrmnt44,msrmnt45,msrmnt46,msrmnt47,msrmnt48,msrmnt49,msrmnt50
    FROM crosstab('SELECT ARRAY[extract(epoch from localt), entity] as row_name, msrmnt, val
               FROM public.test
               ORDER BY localt, entity, msrmnt',$$VALUES  ( 1::text),( 2::text),( 3::text),( 4::text),( 5::text),( 6::text),( 7::text),( 8::text),( 9::text),(10::text)
                                                         ,(11::text),(12::text),(13::text),(14::text),(15::text),(16::text),(17::text),(18::text),(19::text),(20::text)
                                                         ,(21::text),(22::text),(23::text),(24::text),(25::text),(26::text),(27::text),(28::text),(29::text),(30::text)
                                                         ,(31::text),(32::text),(33::text),(34::text),(35::text),(36::text),(37::text),(38::text),(39::text),(40::text)
                                                         ,(41::text),(42::text),(43::text),(44::text),(45::text),(46::text),(47::text),(48::text),(49::text),(50::text)$$)
        as ct (row_name integer[],msrmnt01 double precision, msrmnt02 double precision,msrmnt03 double precision, msrmnt04 double precision,msrmnt05 double precision, 
                    msrmnt06 double precision,msrmnt07 double precision, msrmnt08 double precision,msrmnt09 double precision, msrmnt10 double precision
                 ,msrmnt11 double precision, msrmnt12 double precision,msrmnt13 double precision, msrmnt14 double precision,msrmnt15 double precision, 
                    msrmnt16 double precision,msrmnt17 double precision, msrmnt18 double precision,msrmnt19 double precision, msrmnt20 double precision
                 ,msrmnt21 double precision, msrmnt22 double precision,msrmnt23 double precision, msrmnt24 double precision,msrmnt25 double precision, 
                    msrmnt26 double precision,msrmnt27 double precision, msrmnt28 double precision,msrmnt29 double precision, msrmnt30 double precision
                 ,msrmnt31 double precision, msrmnt32 double precision,msrmnt33 double precision, msrmnt34 double precision,msrmnt35 double precision, 
                    msrmnt36 double precision,msrmnt37 double precision, msrmnt38 double precision,msrmnt39 double precision, msrmnt40 double precision
                 ,msrmnt41 double precision, msrmnt42 double precision,msrmnt43 double precision, msrmnt44 double precision,msrmnt45 double precision, 
                    msrmnt46 double precision,msrmnt47 double precision, msrmnt48 double precision,msrmnt49 double precision, msrmnt50 double precision)
limit 1000

第三次尝试获得这个结果:
QUERY PLAN
Limit  (cost=0.00..20.00 rows=1000 width=432) (actual time=110236.673..110237.667 rows=1000 loops=1)
  ->  Function Scan on crosstab ct  (cost=0.00..20.00 rows=1000 width=432) (actual time=110236.672..110237.598 rows=1000 loops=1)
Total runtime: 110699.598 ms

然后我运行了标准解决方案几次:

explain analyze
select localt, entity, 
 max(case when msrmnt =  1 then val else null end) as msrmnt01
,max(case when msrmnt =  2 then val else null end) as msrmnt02
,max(case when msrmnt =  3 then val else null end) as msrmnt03
,max(case when msrmnt =  4 then val else null end) as msrmnt04
,max(case when msrmnt =  5 then val else null end) as msrmnt05
,max(case when msrmnt =  6 then val else null end) as msrmnt06
,max(case when msrmnt =  7 then val else null end) as msrmnt07
,max(case when msrmnt =  8 then val else null end) as msrmnt08
,max(case when msrmnt =  9 then val else null end) as msrmnt09
,max(case when msrmnt = 10 then val else null end) as msrmnt10
,max(case when msrmnt = 11 then val else null end) as msrmnt11
,max(case when msrmnt = 12 then val else null end) as msrmnt12
,max(case when msrmnt = 13 then val else null end) as msrmnt13
,max(case when msrmnt = 14 then val else null end) as msrmnt14
,max(case when msrmnt = 15 then val else null end) as msrmnt15
,max(case when msrmnt = 16 then val else null end) as msrmnt16
,max(case when msrmnt = 17 then val else null end) as msrmnt17
,max(case when msrmnt = 18 then val else null end) as msrmnt18
,max(case when msrmnt = 19 then val else null end) as msrmnt19
,max(case when msrmnt = 20 then val else null end) as msrmnt20
,max(case when msrmnt = 21 then val else null end) as msrmnt21
,max(case when msrmnt = 22 then val else null end) as msrmnt22
,max(case when msrmnt = 23 then val else null end) as msrmnt23
,max(case when msrmnt = 24 then val else null end) as msrmnt24
,max(case when msrmnt = 25 then val else null end) as msrmnt25
,max(case when msrmnt = 26 then val else null end) as msrmnt26
,max(case when msrmnt = 27 then val else null end) as msrmnt27
,max(case when msrmnt = 28 then val else null end) as msrmnt28
,max(case when msrmnt = 29 then val else null end) as msrmnt29
,max(case when msrmnt = 30 then val else null end) as msrmnt30
,max(case when msrmnt = 31 then val else null end) as msrmnt31
,max(case when msrmnt = 32 then val else null end) as msrmnt32
,max(case when msrmnt = 33 then val else null end) as msrmnt33
,max(case when msrmnt = 34 then val else null end) as msrmnt34
,max(case when msrmnt = 35 then val else null end) as msrmnt35
,max(case when msrmnt = 36 then val else null end) as msrmnt36
,max(case when msrmnt = 37 then val else null end) as msrmnt37
,max(case when msrmnt = 38 then val else null end) as msrmnt38
,max(case when msrmnt = 39 then val else null end) as msrmnt39
,max(case when msrmnt = 40 then val else null end) as msrmnt40
,max(case when msrmnt = 41 then val else null end) as msrmnt41
,max(case when msrmnt = 42 then val else null end) as msrmnt42
,max(case when msrmnt = 43 then val else null end) as msrmnt43
,max(case when msrmnt = 44 then val else null end) as msrmnt44
,max(case when msrmnt = 45 then val else null end) as msrmnt45
,max(case when msrmnt = 46 then val else null end) as msrmnt46
,max(case when msrmnt = 47 then val else null end) as msrmnt47
,max(case when msrmnt = 48 then val else null end) as msrmnt48
,max(case when msrmnt = 49 then val else null end) as msrmnt49
,max(case when msrmnt = 50 then val else null end) as msrmnt50
from sample
group by localt, entity
limit 1000

第三次尝试获得此结果:
QUERY PLAN
Limit  (cost=2257339.69..2270224.77 rows=1000 width=24) (actual time=19795.984..20090.626 rows=1000 loops=1)
  ->  GroupAggregate  (cost=2257339.69..5968242.35 rows=288000 width=24) (actual time=19795.983..20090.496 rows=1000 loops=1)
        ->  Sort  (cost=2257339.69..2293339.91 rows=14400088 width=24) (actual time=19795.626..19808.820 rows=50001 loops=1)
              Sort Key: localt
              Sort Method: external merge  Disk: 478568kB
              ->  Seq Scan on sample  (cost=0.00..249883.88 rows=14400088 width=24) (actual time=0.013..2245.247 rows=14400000 loops=1)
Total runtime: 20197.565 ms

因此,对于我的情况来说,到目前为止,交叉表似乎不是一个解决方案。而这只是我将有多年的其中一天。实际上,尽管测量哪些实体是可变的并且会引入新的实体,但我可能仍然需要使用宽格式(非规范化)表格,但我不会在这里详细介绍。

以下是我在使用Postgres 9.2.3时的一些设置:

name                    setting
max_connections             100
shared_buffers          2097152
effective_cache_size    6291456
maintenance_work_mem    1048576
work_mem                 262144

刚刚偶然发现了这个。当时从没注意过。我已经给我的答案添加了一个解决方案。虽然现在太晚了,但我一直在链接到这个问题,我不想让它无人回答。 - Erwin Brandstetter

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接