使用数据填充表格中缺失的日期（postgresql，redshift）

Question

使用数据填充表格中缺失的日期（postgresql，redshift）

postgresqlamazon-web-servicesamazon-redshiftgaps-in-data

6

我正在尝试填补每日数据中缺失的日期，但找不到答案，请帮忙。

我的daily_table示例：

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |

期望结果：我希望能够为每个域和每天填充此表格的数据，这些数据只需从前一个日期复制即可：

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-14    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-15    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-16    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |

我可以把一部分逻辑移到PHP中，但这并不理想，因为我的表中有数十亿个缺失的日期。

摘要：

在过去几天里，我发现：

Amazon Redshift使用PostgreSql的第8个版本，因此不支持如JOIN LATERAL这样漂亮的命令
Redshift也不支持generate_series和CTEs
但它支持简单的WITH（感谢@systemjack），但不支持WITH RECURSIVE

- D.Dimitrioglo

一个显而易见的问题：为什么？不把空隙留着让网页或其他内容自己决定如何显示，这不更合理吗？ - Tom Lord

这是一个要求，因为我们的客户直接使用表格，而不是通过某个接口。 - D.Dimitrioglo

Redshift支持（递归）CTE吗？ - wildplasser

我不了解CTE，我会阅读一些文档后再回答。 - D.Dimitrioglo

1

正如我所发现的那样，CTE也不被支持... - D.Dimitrioglo

这个答案使用窗口函数生成日期行：https://dev59.com/6mEh5IYBdhLWcg3wIgfN#34167753 - systemjack

4个回答

3

请看查询背后的思想：

select distinct on (domain, new_date) *
from (
    select new_date::date 
    from generate_series('2016-04-12', '2016-04-17', '1d'::interval) new_date
    ) s 
left join a_table t on date <= new_date
order by domain, new_date, date desc;

  new_date  |     domain      |    date    | visitors | hits  
------------+-----------------+------------+----------+-------
 2016-04-12 | www.domain1.com | 2016-04-12 |     1231 | 23423
 2016-04-13 | www.domain1.com | 2016-04-13 |     1374 | 26482
 2016-04-14 | www.domain1.com | 2016-04-13 |     1374 | 26482
 2016-04-15 | www.domain1.com | 2016-04-13 |     1374 | 26482
 2016-04-16 | www.domain1.com | 2016-04-13 |     1374 | 26482
 2016-04-17 | www.domain1.com | 2016-04-17 |     1262 | 21493
(6 rows)

您需要根据需求选择起始日期和结束日期。
由于查询可能会相当昂贵（您提到了数十亿的间隙），因此请谨慎应用它（在较小的数据子集上进行测试或分阶段执行）。

如果没有generate_series()，您可以创建自己的生成器。这里有一个有趣的示例。引用文章中的视图可以替代generate_series()。例如，如果您需要期间'2016-04-12' + 5 天：

select distinct on (domain, new_date) *
from (
    select '2016-04-12'::date+ n new_date
    from generator_16
    where n < 6
    ) s 
left join a_table t on date <= new_date
order by domain, new_date, date desc;

您将会得到和第一个例子一样的结果。

- klin

这非常有趣，谢谢分享，我会尝试并给出答案。 - D.Dimitrioglo

2

一个替代方案，避免使用所有“现代”功能 ;-]

-- \i tmp.sql

        -- NOTE: date and domain are keywords in SQL
CREATE TABLE ztable
        ( zdomain      TEXT NOT NULL
        , zdate       DATE NOT NULL
        , visitors      INTEGER NOT NULL DEFAULT 0
        , hits          INTEGER NOT NULL DEFAULT 0
        , PRIMARY KEY (zdomain,zdate)
        );
INSERT INTO ztable (zdomain,zdate,visitors,hits) VALUES
  ('www.domain1.com', '2016-04-12' ,1231 ,23423 )
 ,('www.domain1.com', '2016-04-13' ,1374 ,26482 )
 ,('www.domain1.com', '2016-04-17' ,1262 ,21493 )
 ,('www.domain3.com', '2016-04-14' ,3245 ,53471 )       -- << cheating!
 ,('www.domain3.com', '2016-04-15' ,2435 ,34571 )
 ,('www.domain3.com', '2016-04-16' ,2354 ,35741 )
 ,('www.domain2.com', '2016-05-09' ,2345 ,35471 ) ;

        -- Create "Calendar" table with all possible dates
        -- from the existing data in ztable.
        -- [if there are sufficient different domains
        -- in ztable there will be no gaps]
        -- [Normally the table would be filled by generate_series()
        -- or even a recursive CTE]
        -- An exta advantage is that a table can be indexed.
CREATE TABLE date_domain AS
SELECT DISTINCT zdate AS zdate
FROM ztable;
ALTER TABLE date_domain ADD PRIMARY KEY (zdate);
-- SELECT * FROM date_domain;

        -- Finding the closest previous record
        -- without using window functions or aggregate queries.
SELECT d.zdate, t.zdate, t.zdomain
        ,t.visitors, t.hits
        , (d.zdate <> t.zdate) AS is_fake -- for fun
FROM date_domain d
LEFT JOIN ztable t
        ON t.zdate <= d.zdate
        AND NOT EXISTS ( SELECT * FROM ztable nx
                WHERE nx.zdomain = t.zdomain
                AND nx.zdate > d.zdate
                AND nx.zdate < t.zdate
                )
ORDER BY t.zdomain, d.zdate
        ;

- wildplasser

1

这是一个丑陋的方法，用于让redshift生成新行，以日期为例。此示例将输出限制为前30天。范围可以进行微调或删除。同样的方法也可以用于分钟、秒等。

with days as (
    select (dateadd(day, -row_number() over (order by true), sysdate::date+'1 day'::interval)) as day
            from stv_blocklist limit 30
)
select day from days order by day

为了针对特定时间范围，请将sysdate更改为一个字面量，该字面量应为所需范围结束后的最后一天，并设置覆盖的天数限制。

插入语句应类似于以下内容：

with days as (
    select (dateadd(day, -row_number() over (order by true), sysdate::date+'1 day'::interval)) as day
            from stv_blocklist limit 30
)
insert into your_table (domain, date) (
    select dns.domain, d.day
    from days d
    cross join (select distinct(domain) from your_table) dns
    left join your_table y on y.domain=dns.domain and y.date=d.day
    where y.date is null
)

我无法测试插入，可能需要进行一些微调。

对于 stv_blocklist 表的引用可以是任何具有足够行数以覆盖 with 子句中范围限制的表，并且用于提供 row_number() 窗口函数的种子。

一旦您将仅包含日期的行放置到位，就可以使用最新的完整记录来更新它们，如下所示：

update your_table set visitors=t.visitors, hits=t.hits
from (
    select a.domain, a.date, b.visitors, b.hits
    from your_table a
    inner join your_table b
        on b.domain=a.domain and b.date=(SELECT max(date) FROM your_table where domain=a.domain and hits is not null and date < a.date)
    where a.hits is null
) t
where your_table.domain=t.domain and your_table.date=t.date

这可能比较慢，但对于较小的数据集或一次性使用应该没问题。我能测试一个类似的查询。

更新：我认为这个填充空值的查询版本应该更好，并考虑域和日期。我测试了一个类似的版本。

update your_table set visitors=t.prev_visitors, hits=t.prev_hits
from (
    select domain, date, hits
        lag(visitors,1) ignore nulls over (partition by domain order by date) as prev_visitors,
        lag(hits,1) ignore nulls over (partition by domain order by date) as prev_hits
    from your_table
) t
where t.hits is null and your_table.domain=t.domain and your_table.date=t.date

可以将这个与初始人口查询结合起来一次性完成。

- systemjack

我会尽快尝试并告诉您。 - D.Dimitrioglo

@wildplasser，Redshift 很好地支持窗口函数。我经常使用它们。http://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html - systemjack

我很惊讶，Redshift支持WITH xx AS，但不支持WITH RECURSIVE xx AS。现在我正在尝试使用它，看起来它会很有用。 - D.Dimitrioglo

很不幸，我每个日期都有多个域，而每个域的最小/最大日期都不同。 - D.Dimitrioglo

我进行了未经测试的更新，应该能够填充每个域中指定范围内的所有日期。当我准备好时，我还有另一个想法要发布。 - systemjack

显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- D.Dimitrioglo · Accepted Answer

最终，我完成了我的任务，想分享一些有用的东西。

我使用这个钩子而不是generate_series：

WITH date_range AS (
  SELECT trunc(current_date - (row_number() OVER ())) AS date
  FROM any_table  -- any of your table which has enough data
  LIMIT 365
) SELECT * FROM date_range;

为了获取我需要填充数据的URL列表，我使用了以下代码：

WITH url_list AS (
  SELECT
    url AS gapsed_url,
    MIN(timestamp_gmt) AS min_date,
    MAX(timestamp_gmt) AS max_date
  FROM daily_table
  WHERE url IN (
    SELECT url FROM daily_table GROUP BY url
    HAVING count(url) < (MAX(timestamp_gmt) - MIN(timestamp_gmt) + 1)
  )
  GROUP BY url
) SELECT * FROM url_list;

然后我将给定的数据合并起来，我们称之为 url_mapping:

SELECT t1.*, t2.gapsed_url FROM date_range AS t1 CROSS JOIN url_list AS t2
WHERE t1.date <= t2.max_date AND t1.date >= t2.min_date;

为了根据最接近的日期获取数据，我执行了以下操作:

SELECT sd.*
FROM url_mapping AS um JOIN daily_table AS sd
ON um.gapsed_url = sd.url AND (
  sd.timestamp_gmt = (SELECT max(timestamp_gmt) FROM daily_table WHERE url = sd.url AND timestamp_gmt <= um.date)
)

我希望这能对某些人有所帮助。