SQL / Oracle 聚合区间日期桶

3
我有一个关于SQL的问题,希望能得到一些帮助,因为一直找不到合适的答案。
背景:
我正在使用供应商产品,它具有作为后端的Oracle数据库。我可以编写任何adhoc SQL来查询底层表,但不能对其基础结构(或数据模型本身)进行任何更改。我目前感兴趣的表大约有100万行,主要用于跟踪用户会话。它有4个感兴趣的列:session_id(这是一个主键,每个会话唯一),user_namestart_date(跟踪会话开始的日期)和stop_date(跟踪会话结束的日期)。我的目标是根据给定的开始日期和结束日期,在月、日和小时级别上执行活动会话的数据聚合。我需要创建一个视图(或3个单独的视图),它可以执行聚合本身,或者作为中间对象,然后我可以查询并执行聚合操作。我理解最终的SQL/视图可能实际上需要3个不同的视图(一个用于月份,一个用于天数,一个用于小时数),但似乎无论时间段如何,概念都应该是相同的。
当前表示例:
表名 = web_session
| Session_id | user_name | start_date            | stop_date    
----------------------------------------------------------------------------
|      1     |    joe    | 4/20/2017 10:42:10 PM | 4/21/2017 2:42:10 AM  |
|      2     |   matt    | 4/20/2017 5:43:10 PM  | 4/20/2017 5:59:10 PM  |
|      3     |   matt    | 4/20/2017 3:42:10 PM  | 4/20/2017 5:42:10 PM  |
|      4     |    joe    | 4/20/2017 11:20:10 AM | 4/20/2017 4:42:10 PM  |
|      5     |   john    | 4/20/2017 8:42:10 AM  | 4/20/2017 11:42:10 AM |
|      6     |   matt    | 4/20/2017 7:42:10 AM  | 4/20/2017 11:42:10 PM | 
|      7     |    joe    | 4/19/2017 11:20:10 PM | 4/20/2017 1:42:10 AM  |

小时视图的理想输出

- 对于示例,12:00可以是0或24。

| Date           | HR   | active_sessions | distinct_users |
------------------------------------------------------------
| 4/21/2017      | 2    | 1               | 1              |
| 4/21/2017      | 1    | 1               | 1              |
| 4/20/2017      | 0    | 1               | 1              |
| 4/20/2017      | 23   | 1               | 1              |
| 4/20/2017      | 22   | 1               | 1              |
| 4/20/2017      | 17   | 2               | 1              |
| 4/20/2017      | 16   | 2               | 2              |
| 4/20/2017      | 15   | 2               | 2              |
| 4/20/2017      | 14   | 1               | 1              |
| 4/20/2017      | 13   | 1               | 1              |
| 4/20/2017      | 12   | 1               | 1              |
| 4/20/2017      | 11   | 3               | 3              |
| 4/20/2017      | 10   | 2               | 2              |
| 4/20/2017      | 9    | 2               | 2              |
| 4/20/2017      | 8    | 2               | 2              |
| 4/20/2017      | 7    | 1               | 1              |
| 4/20/2017      | 1    | 1               | 1              |
| 4/20/2017      | 0    | 1               | 1              |
| 4/19/2017      | 23   | 1               | 1              |

目标及其他选项

我想通过这个输出最终实现的目标是,在两个日期之间,以月、日或小时为单位显示活动会话数的折线图。在小时示例中,结合日期和小时使用X轴,活动会话使用Y轴。如果用户悬停在图表上的点上,则可以查看不同的用户数。请注意,活动会话是在时间间隔内任何时间打开的会话总数。不同的用户是时间间隔内的不同用户总数。如果我在同一小时内登录和注销了两次,则会有2个活动会话,但只有1个不同的用户。

备选方案

这似乎是一个经常出现的问题,但从我所有的谷歌搜索和栈溢出研究中,我似乎找不到正确的方法。如果我的查询或理想输出思路有误,我也接受替代建议,以使我能够在前端适当地填充所需的输出以生成图表。

我尝试过的一些SQL(真诚的努力)

我尝试过很多查询,但我将从这个查询开始,因为它是我最接近的,但速度非常慢(无法使用),而且它仍然没有产生我需要的结果。

Select * FROM (

  SELECT 
  u.YearDt, u.MonthDt, u.DayDt, u.HourDt, u.MinDt,
  COUNT(Distinct u.session_id) as unique_sessions,
  COUNT(Distinct u.user_name) as unique_users,
  LISTAGG(u.user_name, ', ') WITHIN GROUP (ORDER BY u.user_name ASC) as users
  FROM
  (SELECT  EXTRACT(year FROM l.start_date) as YearDt,
              EXTRACT(month FROM l.start_date) as MonthDt,
              EXTRACT(day FROM l.start_date) as DayDt,
              EXTRACT(HOUR FROM CAST(l.start_date AS TIMESTAMP)) as HourDt,
              EXTRACT(MINUTE FROM CAST(l.start_date AS TIMESTAMP)) as MinDt,
              l.session_id,
              l.user_name,
              l.start_date as act_date,
              1 as is_start
  FROM web_session l
  UNION ALL
  SELECT  EXTRACT(year FROM l.stop_date) as YearDt,
              EXTRACT(month FROM l.stop_date) as MonthDt,
              EXTRACT(day FROM l.stop_date) as DayDt,
              EXTRACT(HOUR FROM CAST(l.stop_date AS TIMESTAMP)) as HourDt,
              EXTRACT(MINUTE FROM CAST(l.stop_date AS TIMESTAMP)) as MinDt,
              l.session_id,
              l.user_name,
              l.stop_date as act_date,
              0 as is_start
  FROM web_session l
  ) u
  GROUP BY CUBE ( u.YearDt, u.MonthDt, u.DayDt, u.HourDt, u.MinDt)
) c

@Shawn 说实话,那很可能是之前尝试让查询工作的残留片段。我想我曾经试图从这里 https://discourse.looker.com/t/sql-pattern-summarizing-entities-with-a-start-end-date-over-time/4868 适应它。 - Matt
@DavidFaber 如果有这个选项的话会很好,因为它会使最终的图表更加一致,但这并非必需。我希望同样的模式也可以应用于天数和月份。 - Matt
活动会话和不同用户之间有什么区别?几乎都是相同的,只有这一行不同:| 4/20/2017 | 17 | 2 | 1 |,你能解释一下它们为什么不同吗? - krokodilko
@krokodilko 当然可以。在示例数据中,我只提供了一个例子,但这很重要。活跃会话是指在任何时间点打开的会话总数。不同用户是指在该时间段内的不同用户总数。如果我在同一小时内登录和注销两次,则会有2个活动会话,但只有1个不同的用户。 - Matt
由于您无法向数据库添加内容,我建议使用一个按小时分解的日历表,并将其连接到您的查询中,而不是尝试在每行数据上提取所有日期部分。 - Shawn
显示剩余4条评论
3个回答

2
您可以使用CTE(查询1)或相关分层查询(查询2)生成时间范围内的小时数,然后进行聚合。这只需要进行一次表扫描:
SQL Fiddle Oracle 11g R2模式设置:
CREATE TABLE Web_Session ( Session_id, user_name, start_date, stop_date ) AS
SELECT 1, 'joe',  CAST( TIMESTAMP '2017-04-20 22:42:10' AS DATE ), CAST( TIMESTAMP '2017-04-21 02:42:10' AS DATE ) FROM DUAL UNION ALL
SELECT 2, 'matt', TIMESTAMP '2017-04-20 17:43:10', TIMESTAMP '2017-04-20 17:59:10' FROM DUAL UNION ALL
SELECT 3, 'matt', TIMESTAMP '2017-04-20 15:42:10', TIMESTAMP '2017-04-20 17:42:10' FROM DUAL UNION ALL
SELECT 4, 'joe',  TIMESTAMP '2017-04-20 11:20:10', TIMESTAMP '2017-04-20 16:42:10' FROM DUAL UNION ALL
SELECT 5, 'john', TIMESTAMP '2017-04-20 08:42:10', TIMESTAMP '2017-04-20 11:42:10' FROM DUAL UNION ALL
SELECT 6, 'matt', TIMESTAMP '2017-04-20 07:42:10', TIMESTAMP '2017-04-20 23:42:10' FROM DUAL UNION ALL
SELECT 7, 'joe',  TIMESTAMP '2017-04-19 23:20:10', TIMESTAMP '2017-04-20 01:42:10' FROM DUAL;

查询1:
WITH hours ( session_id, user_name, hour, duration ) AS (
  SELECT session_id,
         user_name,
         CAST( TRUNC( start_date, 'HH24' ) AS DATE ),
         ( TRUNC( stop_date, 'HH24' ) - TRUNC( start_date, 'HH24' ) ) * 24
  FROM   web_session
UNION ALL
  SELECT session_id,
         user_name,
         hour + INTERVAL '1' HOUR, -- There is a bug in SQLFiddle that subtracts
                                   -- hours instead of adding so -1 is used there.
         duration - 1
  FROM   hours
  WHERE  duration > 0
)
SELECT hour,
       COUNT( session_id ) AS active_sessions,
       COUNT( DISTINCT user_name ) AS distinct_users
FROM   hours
GROUP BY hour
ORDER BY hour

结果:

结果

|                 HOUR | ACTIVE_SESSIONS | DISTINCT_USERS |
|----------------------|-----------------|----------------|
| 2017-04-19T23:00:00Z |               1 |              1 |
| 2017-04-20T00:00:00Z |               1 |              1 |
| 2017-04-20T01:00:00Z |               1 |              1 |
| 2017-04-20T07:00:00Z |               1 |              1 |
| 2017-04-20T08:00:00Z |               2 |              2 |
| 2017-04-20T09:00:00Z |               2 |              2 |
| 2017-04-20T10:00:00Z |               2 |              2 |
| 2017-04-20T11:00:00Z |               3 |              3 |
| 2017-04-20T12:00:00Z |               2 |              2 |
| 2017-04-20T13:00:00Z |               2 |              2 |
| 2017-04-20T14:00:00Z |               2 |              2 |
| 2017-04-20T15:00:00Z |               3 |              2 |
| 2017-04-20T16:00:00Z |               3 |              2 |
| 2017-04-20T17:00:00Z |               3 |              1 |
| 2017-04-20T18:00:00Z |               1 |              1 |
| 2017-04-20T19:00:00Z |               1 |              1 |
| 2017-04-20T20:00:00Z |               1 |              1 |
| 2017-04-20T21:00:00Z |               1 |              1 |
| 2017-04-20T22:00:00Z |               2 |              2 |
| 2017-04-20T23:00:00Z |               2 |              2 |
| 2017-04-21T00:00:00Z |               1 |              1 |
| 2017-04-21T01:00:00Z |               1 |              1 |
| 2017-04-21T02:00:00Z |               1 |              1 |

执行计划:
-------------------------------------------------------------------------------------------------------
| Id  | Operation                                      | Name        | Rows | Bytes | Cost | Time     |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                               |             |   14 |   364 |    7 | 00:00:01 |
|   1 |   SORT GROUP BY                                |             |   14 |   364 |    7 | 00:00:01 |
|   2 |    VIEW                                        | VW_DAG_0    |   14 |   364 |    7 | 00:00:01 |
|   3 |     HASH GROUP BY                              |             |   14 |   364 |    7 | 00:00:01 |
|   4 |      VIEW                                      |             |   14 |   364 |    6 | 00:00:01 |
|   5 |       UNION ALL (RECURSIVE WITH) BREADTH FIRST |             |      |       |      |          |
|   6 |        TABLE ACCESS FULL                       | WEB_SESSION |    7 |   245 |    3 | 00:00:01 |
| * 7 |        RECURSIVE WITH PUMP                     |             |      |       |      |          |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 7 - filter("DURATION">0)

Note
-----
- dynamic sampling used for this statement

查询2:
SELECT t.COLUMN_VALUE AS hour,
       COUNT( session_id ) AS active_sessions,
       COUNT( DISTINCT user_name ) AS distinct_users
FROM   web_session w
       CROSS JOIN
       TABLE(
         CAST(
           MULTISET(
             SELECT TRUNC( w.start_date, 'HH24' ) + ( LEVEL - 1 ) / 24
             FROM   DUAL
             CONNECT BY TRUNC( w.start_date, 'HH24' ) + ( LEVEL - 1 ) / 24 < w.stop_date
           ) AS SYS.ODCIDATELIST
         )
       ) t
GROUP BY t.COLUMN_VALUE
ORDER BY hour
结果

|                 HOUR | ACTIVE_SESSIONS | DISTINCT_USERS |
|----------------------|-----------------|----------------|
| 2017-04-19T23:00:00Z |               1 |              1 |
| 2017-04-20T00:00:00Z |               1 |              1 |
| 2017-04-20T01:00:00Z |               1 |              1 |
| 2017-04-20T07:00:00Z |               1 |              1 |
| 2017-04-20T08:00:00Z |               2 |              2 |
| 2017-04-20T09:00:00Z |               2 |              2 |
| 2017-04-20T10:00:00Z |               2 |              2 |
| 2017-04-20T11:00:00Z |               3 |              3 |
| 2017-04-20T12:00:00Z |               2 |              2 |
| 2017-04-20T13:00:00Z |               2 |              2 |
| 2017-04-20T14:00:00Z |               2 |              2 |
| 2017-04-20T15:00:00Z |               3 |              2 |
| 2017-04-20T16:00:00Z |               3 |              2 |
| 2017-04-20T17:00:00Z |               3 |              1 |
| 2017-04-20T18:00:00Z |               1 |              1 |
| 2017-04-20T19:00:00Z |               1 |              1 |
| 2017-04-20T20:00:00Z |               1 |              1 |
| 2017-04-20T21:00:00Z |               1 |              1 |
| 2017-04-20T22:00:00Z |               2 |              2 |
| 2017-04-20T23:00:00Z |               2 |              2 |
| 2017-04-21T00:00:00Z |               1 |              1 |
| 2017-04-21T01:00:00Z |               1 |              1 |
| 2017-04-21T02:00:00Z |               1 |              1 |

执行计划:
--------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name        | Rows  | Bytes   | Cost | Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |             | 57176 | 2115512 |  200 | 00:00:03 |
|   1 |   SORT GROUP BY                        |             | 57176 | 2115512 |  200 | 00:00:03 |
|   2 |    NESTED LOOPS                        |             | 57176 | 2115512 |  195 | 00:00:03 |
|   3 |     TABLE ACCESS FULL                  | WEB_SESSION |     7 |     245 |    3 | 00:00:01 |
|   4 |     COLLECTION ITERATOR SUBQUERY FETCH |             |  8168 |   16336 |   27 | 00:00:01 |
| * 5 |      CONNECT BY WITHOUT FILTERING      |             |       |         |      |          |
|   6 |       FAST DUAL                        |             |     1 |         |    2 | 00:00:01 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 5 - filter(TRUNC(:B1,'fmhh24')+(LEVEL-1)/24<:B2)

Note
-----
- dynamic sampling used for this statement

这个很好用,我可以很容易地修改它来计算我需要按天或月分组的内容。需要注意的是,这仅包括有任何活动的时期,而不是所有时期,这就是我所说的要求。非常感谢! - Matt

1
我认为这样做会起作用:

WITH ct ( active_dt ) AS (
    -- Build the query for the "table" of hours
    SELECT DATE'2018-04-19' + (LEVEL-1)/24 AS active_dt FROM dual
   CONNECT BY DATE'2018-04-19' + (LEVEL-1)/24 < DATE'2018-04-22'
)
SELECT active_dt AS "Date", active_hr AS "HR"
     , COUNT(session_id) AS active_sessions
     , COUNT(DISTINCT user_name) AS distinct_users
  FROM (
    SELECT TRUNC(ct.active_dt) AS active_dt
         , TO_CHAR(ct.active_dt, 'HH24') AS active_hr
         , ws.session_id, ws.user_name
      FROM ct LEFT JOIN web_session ws
        ON ct.active_dt + 1/24 >= ws.start_dt
       AND ct.active_dt < ws.stop_dt
) GROUP BY active_dt, active_hr
 ORDER BY active_dt DESC, active_hr DESC;

我可能没有完全正确理解 LEFT JOIN 的使用条件。

希望这能帮到你。


这个很好用,我可以很容易地修改它来计算我需要按天或月分组的内容。我非常感激它展示了所有时间段,即使没有活动,这对我想要显示的图表也很有帮助。谢谢! - Matt
@Matt,如果这个答案对您有帮助,请接受它。 - David Faber

1

Matt,

你需要做的是生成一个时间维度,可以是静态表或者在运行时动态生成:

create table time_dim (
  ts date primary key,
  year number not null,
  month number not null,
  day number not null,
  wday number not null,
  dy varchar2(3) not null,
  hr number not null
);

insert into time_dim (ts, year, month, day, wday, dy, hr)
select ts
     , extract(year from ts) year
     , extract(month from ts) month
     , extract(day from ts) day
     , to_char(ts,'d') wday
     , to_char(ts,'dy') dy
     , to_number(to_char(ts,'HH24')) hr
  from (
select DATE '2017-01-01' + (level - 1)/24 ts
  FROM DUAL connect by level <= 365*24) a;

然后将其外连接到您的web_sessions表:
select t.ts, t.year, t.month, t.wday, t.dy, t.hr
     , count(session_id) sessions
     , count(distinct user_name) users
  from time_dim t
  left join web_session w
    on t.ts between trunc(w.start_date, 'hh24') and w.stop_date
 where trunc(t.ts) between date '2017-04-19' and date '2017-04-21'
 group by rollup (t.year, t.month, (t.wday, t.dy), (t.hr, t.ts));

你可以更改group by子句以获取你感兴趣的各种聚合数据。
在上面的代码中,我在ON子句中将start_date截断到小时,以便开始小时包含在结果中,否则不是整点开始的会话将不会在该小时内计算。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接