使用SQL计算最长的连续观看电视剧时间

3

我正在尝试分析一些连续观看的统计数据,我想知道最长连续观看记录是多少(连续观看是指在两个小时之内,一个接一个地连续观看多个节目)。数据如下:

datetime                user_id program
2013-09-01 00:01:18     1       A
2013-09-10 14:03:14     1       B
2013-09-20 17:02:12     2       A  
2013-09-21 00:03:22     2       C  <-- user 2 binge start
2013-09-21 01:23:22     2       M
2013-09-21 03:03:22     2       E
2013-09-21 04:03:22     2       F  
2013-09-21 06:03:22     2       G  <-- user 2 binge end
2013-09-21 09:03:22     2       H
2013-09-03 18:21:09     3       D
2013-09-21 09:03:22     2       H
2013-09-24 19:21:00     2       X  <-- user 2 second binge start
2013-09-24 20:21:00     2       Y
2013-09-24 21:21:00     2       Z  <-- user 2 second binge end

SQL Fiddle

在这个例子中,用户2有一个持续6小时的狂欢,后来又有一个持续2小时的狂欢。

我想要的最终结果是:

user_id     binge     length
2           1         6 hours
2           2         2 hours

这个能在数据库中直接计算吗?


1
我不知道什么是“bing”,也不知道如何衡量它是否足够“长”。如果您能在问题中正确描述它,并且还能添加一个更完整的示例,那么我将能够研究可能的 SQL 解决方案来解决您的情况。 - vyegorov
1
“或者”是什么意思?你的意思是想要连续的行数,其中下一行的时间是前一个时间加上1.5-2.5小时。哦,还有这是Postgres,你可以在其中做任何事情,甚至很多你不应该做的事情。 - Jakub Kania
@JakubKania 没错,如果下一个视图的时间比上一个视图的时间少于2小时,我会尝试使用这个限制。 - jenswirf
2013-09-21 09:03:22 2 H 这里出现了两次。我在我的示例中将第二个放在了2013-09-22,但这不应该有任何影响。 - Bruno
2个回答

3
这是一个关于在数据中识别序列/连续性的问题。我喜欢的解决方法是:
  • 使用LAG函数来识别每个连续性的开始
  • 使用SUM函数为每个连续性分配唯一的编号
  • 然后按照这个唯一编号进行分组以进行进一步处理

查询:

with start_grp as (
  select dt, user_id, programme,
         case when dt - lag(dt,1) over (partition by user_id order by dt) 
                   > interval '0 day 2:00:00'
              then 1
              else 0
         end grp_start
  from binge
  ),
assign_grp as (
  select dt, user_id, programme,
  sum(grp_start) over (partition by user_id order by dt) grp
  from start_grp)
select user_id, grp as binge, max(dt) - min(dt) as binge_length
from assign_grp
group by user_id, grp
having count(programme) > 1

这里的 binge 列可能不是按照顺序排列的。您可以在最终查询上使用 ROW_NUMBER 函数来进行纠正。
演示请参见 sqlfiddle

使用 grp_start 肯定比我做的递归 CTE 更简单。 - Bruno

1
这里提供一种使用递归CTE(其实不是真正的“递归”,但它们就是这样被称呼)和窗口函数的解决方案。您需要至少安装PostgreSQL 8.4才能运行此方法。

SQL Fiddle

PostgreSQL 9.1.9模式设置:
CREATE TABLE viewings (
    user_id INTEGER NOT NULL,
    datetime TIMESTAMPTZ NOT NULL,
    programme TEXT NOT NULL,
    PRIMARY KEY (user_id, datetime)
);

INSERT INTO viewings (datetime, user_id, programme) VALUES
('2013-09-01 00:01:18', 1, 'A'),
('2013-09-10 14:03:14', 1, 'B'),
('2013-09-20 17:02:12', 2, 'A'),
('2013-09-21 00:03:22', 2, 'C'),
('2013-09-21 01:23:22', 2, 'M'),
('2013-09-21 03:03:22', 2, 'E'),
('2013-09-21 04:03:22', 2, 'F'),
('2013-09-21 06:03:22', 2, 'G'),
('2013-09-21 09:03:22', 2, 'H'),
('2013-09-03 18:21:09', 3, 'D'),
('2013-09-22 09:03:22', 2, 'H'),
('2013-09-24 19:21:00', 2, 'X'),
('2013-09-24 20:21:00', 2, 'Y'),
('2013-09-24 21:21:00', 2, 'Z');

查询 1:

WITH RECURSIVE consecutive_viewings(user_id, first_dt, last_dt) AS (
  WITH lagged_viewings AS (
    SELECT user_id, LAG(user_id) OVER w AS prev_user_id,
           datetime, LAG(datetime) OVER w AS prev_datetime,
           programme
    FROM viewings WINDOW w AS (PARTITION BY user_id ORDER BY datetime)
  )
  SELECT user_id, datetime AS first_dt, datetime AS last_dt
    FROM lagged_viewings
    WHERE prev_datetime IS NULL OR (prev_datetime + '2 hours'::interval) < datetime
  UNION ALL
  SELECT lv.user_id, cv.first_dt, lv.datetime AS last_dt
    FROM consecutive_viewings cv
      INNER JOIN lagged_viewings lv
      ON lv.user_id=cv.user_id AND
         lv.prev_datetime=cv.last_dt
      WHERE (lv.prev_datetime + '2 hours'::interval) >= lv.datetime
)
SELECT user_id, first_dt, MAX(last_dt) AS last_dt
   FROM consecutive_viewings
   WHERE first_dt != last_dt
   GROUP BY user_id, first_dt
   ORDER BY user_id, first_dt
结果:
| USER_ID |                         FIRST_DT |                          LAST_DT |
|---------|----------------------------------|----------------------------------|
|       2 | September, 21 2013 00:03:22+0000 | September, 21 2013 06:03:22+0000 |
|       2 | September, 24 2013 19:21:00+0000 | September, 24 2013 21:21:00+0000 |

为了理解这个问题,也许从最嵌套的CTE开始更容易。它将按user_iddatetime对视图进行排序,但同时还会添加一个额外的列,其中包含前一个视图的时间戳,以便稍后可以链接它们。这不是递归CTE(即使在单独的以下查询中也不需要CTE): 查询2:
WITH lagged_viewings AS (
    SELECT user_id, LAG(user_id) OVER w AS prev_user_id,
           datetime, LAG(datetime) OVER w AS prev_datetime,
           programme
    FROM viewings WINDOW w AS (PARTITION BY user_id ORDER BY datetime)
)
SELECT * FROM lagged_viewings

结果:

| USER_ID | PREV_USER_ID |                         DATETIME |                    PREV_DATETIME | PROGRAMME |
|---------|--------------|----------------------------------|----------------------------------|-----------|
|       1 |       (null) | September, 01 2013 00:01:18+0000 |                           (null) |         A |
|       1 |            1 | September, 10 2013 14:03:14+0000 | September, 01 2013 00:01:18+0000 |         B |
|       2 |       (null) | September, 20 2013 17:02:12+0000 |                           (null) |         A |
|       2 |            2 | September, 21 2013 00:03:22+0000 | September, 20 2013 17:02:12+0000 |         C |
|       2 |            2 | September, 21 2013 01:23:22+0000 | September, 21 2013 00:03:22+0000 |         M |
|       2 |            2 | September, 21 2013 03:03:22+0000 | September, 21 2013 01:23:22+0000 |         E |
|       2 |            2 | September, 21 2013 04:03:22+0000 | September, 21 2013 03:03:22+0000 |         F |
|       2 |            2 | September, 21 2013 06:03:22+0000 | September, 21 2013 04:03:22+0000 |         G |
|       2 |            2 | September, 21 2013 09:03:22+0000 | September, 21 2013 06:03:22+0000 |         H |
|       2 |            2 | September, 22 2013 09:03:22+0000 | September, 21 2013 09:03:22+0000 |         H |
|       2 |            2 | September, 24 2013 19:21:00+0000 | September, 22 2013 09:03:22+0000 |         X |
|       2 |            2 | September, 24 2013 20:21:00+0000 | September, 24 2013 19:21:00+0000 |         Y |
|       2 |            2 | September, 24 2013 21:21:00+0000 | September, 24 2013 20:21:00+0000 |         Z |
|       3 |       (null) | September, 03 2013 18:21:09+0000 |                           (null) |         D |

这个递归CTE可能有点棘手。 "递归" 基于两个select语句之间的联合。
  • 第一个语句是种子迭代(非递归部分):它将找到开始观看链的行(即,如果它是该用户的第一次观看,则前一个日期时间为null,或者前一个日期时间比您的截止间隔更远)。
  • 第二个语句链接该观看以形成更长的时间段。一些持续时间会重叠,因为它不知道何时结束。这就是在顶部的整体查询中使用条件的原因,以查找最大值并消除仅具有单个观看的时间段。

查询3

WITH RECURSIVE consecutive_viewings(user_id, first_dt, last_dt) AS (
  WITH lagged_viewings AS (
    SELECT user_id, LAG(user_id) OVER w AS prev_user_id,
           datetime, LAG(datetime) OVER w AS prev_datetime,
           programme
    FROM viewings WINDOW w AS (PARTITION BY user_id ORDER BY datetime)
  )
  -- These are the starts of the "binge" durations
  SELECT user_id, datetime AS first_dt, datetime AS last_dt
    FROM lagged_viewings
    WHERE prev_datetime IS NULL OR (prev_datetime + '2 hours'::interval) < datetime
  UNION ALL
  -- These are the extended periods
  SELECT lv.user_id, cv.first_dt, lv.datetime AS last_dt
    FROM consecutive_viewings cv
      INNER JOIN lagged_viewings lv
      ON lv.user_id=cv.user_id AND
         lv.prev_datetime=cv.last_dt
      WHERE (lv.prev_datetime + '2 hours'::interval) >= lv.datetime
)
SELECT * FROM consecutive_viewings
   ORDER BY user_id, first_dt, last_dt

结果:

| USER_ID |                         FIRST_DT |                          LAST_DT |
|---------|----------------------------------|----------------------------------|
|       1 | September, 01 2013 00:01:18+0000 | September, 01 2013 00:01:18+0000 |
|       1 | September, 10 2013 14:03:14+0000 | September, 10 2013 14:03:14+0000 |
|       2 | September, 20 2013 17:02:12+0000 | September, 20 2013 17:02:12+0000 |
|       2 | September, 21 2013 00:03:22+0000 | September, 21 2013 00:03:22+0000 |
|       2 | September, 21 2013 00:03:22+0000 | September, 21 2013 01:23:22+0000 |
|       2 | September, 21 2013 00:03:22+0000 | September, 21 2013 03:03:22+0000 |
|       2 | September, 21 2013 00:03:22+0000 | September, 21 2013 04:03:22+0000 |
|       2 | September, 21 2013 00:03:22+0000 | September, 21 2013 06:03:22+0000 |
|       2 | September, 21 2013 09:03:22+0000 | September, 21 2013 09:03:22+0000 |
|       2 | September, 22 2013 09:03:22+0000 | September, 22 2013 09:03:22+0000 |
|       2 | September, 24 2013 19:21:00+0000 | September, 24 2013 19:21:00+0000 |
|       2 | September, 24 2013 19:21:00+0000 | September, 24 2013 20:21:00+0000 |
|       2 | September, 24 2013 19:21:00+0000 | September, 24 2013 21:21:00+0000 |
|       3 | September, 03 2013 18:21:09+0000 | September, 03 2013 18:21:09+0000 |

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接