BigQuery:在重复记录中选择最小差异

3

考虑在BigQuery中使用的表结构:

Table User
{
user_id: STRING (REQUIRED)
user_name: STRING (REQUIRED)
actions: RECORD (REPEATED) 
    {
        action_id: STRING (REQUIRED)
        action_type: INTEGER (REQUIRED)
        action_date: TIMESTAMP (REQUIRED)
    }
}

我希望找到所有创建过某种类型动作超过一次的用户(user_id和user_name),并且这些动作之间的最短时间小于X天。

每个用户存储的动作数量没有定义(可以是1、2或n)。这些动作没有按任何标准排序(但我认为使用ORDER BY可以解决这个问题)。

例如,对于以下用户:

{
    user_id: "u1", 
    user_name: "User 1", 
    actions: 
    {action_id: "a1", action_type: 1, action_date: "2016-02-22"},
    {action_id: "a2", action_type: 1, action_date: "2016-01-22"},
    {action_id: "a3", action_type: 1, action_date: "2015-12-22"}
},
{
    user_id: "u2", 
    user_name: "User 2", 
    actions: 
    {action_id: "a4", action_type: 1, action_date: "2016-02-22"},
    {action_id: "a5", action_type: 2, action_date: "2016-01-22"},
    {action_id: "a6", action_type: 1, action_date: "2015-12-22"}
},
{
    user_id: "u3", 
    user_name: "User 3", 
    actions: 
    {action_id: "a7", action_type: 1, action_date: "2016-02-22"}
},
{
    user_id: "u4", 
    user_name: "User 4", 
    actions: 
    {action_id: "a8", action_type: 1, action_date: "2016-02-22"},
    {action_id: "a9", action_type: 1, action_date: "2015-02-22"},
    {action_id: "a10", action_type: 1, action_date: "2015-01-22"}
},

如何在BigQuery上执行查询:“选择执行了类型为1动作超过一次,并且每次执行之间最小时间少于45天的用户”,应该返回User 1User 4

有什么想法可以实现吗?


@MikhailBerlyant 我还没有标记为已接受,因为我还没有时间测试,请耐心等待。 - Gilberto Torrezan
1个回答

2

请尝试以下内容
这段代码是用Go语言编写的,虽然没有经过测试,但我认为它应该能够正常工作并满足您的需求。

SELECT 
  user_id, 
  user_name, 
  action_type, 
  MIN(DATEDIFF(action_date_next, action_date)) AS min_distance
FROM (
  SELECT 
    user_id, 
    user_name, 
    action_type, 
    action_date, 
    LAG(action_date) 
        OVER(PARTITION BY user_id, action_type 
        ORDER BY action_date DESC) AS action_date_next
  FROM (
    SELECT 
      user_id, 
      user_name, 
      actions.action_type AS action_type, 
      actions.action_date AS action_date 
    FROM table_users 
  )
)
WHERE action_date_next IS NOT NULL
GROUP BY user_id, user_name, action_type
HAVING action_type = 1 AND min_distance < 45

以下版本更加紧凑 - 也可以尝试使用。
SELECT 
  user_id, 
  user_name, 
  action_type, 
  MIN(DATEDIFF(action_date_next, action_date)) AS min_distance
FROM (
  SELECT 
    user_id, 
    user_name, 
    actions.action_type AS action_type, 
    actions.action_date AS action_date, 
    LAG(actions.action_date) 
        OVER(PARTITION BY user_id, actions.action_type 
        ORDER BY actions.action_date DESC) AS action_date_next
  FROM table_users
)
WHERE action_date_next IS NOT NULL
GROUP BY user_id, user_name, action_type
HAVING action_type = 1 AND min_distance < 45

BigQuery的LAG函数和TIMESTAMP存在一个bug,请查看此问题。除此之外,答案似乎解决了这个问题,我会在测试完毕后接受它。 - Gilberto Torrezan

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接