如何在SQL Server中对滑动窗口进行聚合(计算不同项)?

14

我当前正在使用这个查询(在SQL Server中)来计算每天唯一物品的数量:

SELECT Date, COUNT(DISTINCT item) 
FROM myTable 
GROUP BY Date 
ORDER BY Date

如何将此转换为每个日期的过去3天(包括当前日期)内的唯一项目数量?

输出应该是一个有2列的表格:原始表格中的所有日期列和第二列,我们有每个日期的唯一项目数量。

例如,如果原始表格是:

Date        Item  
01/01/2018  A  
01/01/2018  B  
02/01/2018  C  
03/01/2018  C    
04/01/2018  C

使用上面的查询,我目前可以获得每天的唯一计数:

Date        count  
01/01/2018  2  
02/01/2018  1  
03/01/2018  1  
04/01/2018  1

我希望得到的结果是在三天滚动窗口内的唯一计数:

Date        count  
01/01/2018  2  
02/01/2018  3  (because items ABC on 1st and 2nd Jan)
03/01/2018  3  (because items ABC on 1st,2nd,3rd Jan)    
04/01/2018  1  (because only item C on 2nd,3rd,4th Jan)    
9个回答

8
使用 apply 提供了一种方便的方法来形成滑动窗口。
CREATE TABLE myTable 
    ([DateCol] datetime, [Item] varchar(1))
;

INSERT INTO myTable 
    ([DateCol], [Item])
VALUES
    ('2018-01-01 00:00:00', 'A'),
    ('2018-01-01 00:00:00', 'B'),
    ('2018-01-02 00:00:00', 'C'),
    ('2018-01-03 00:00:00', 'C'),
    ('2018-01-04 00:00:00', 'C')
;

CREATE NONCLUSTERED INDEX IX_DateCol  
    ON MyTable([Date])  
;    

查询:

select distinct 
       t1.dateCol
     , oa.ItemCount
from myTable t1
outer apply (
      select count(distinct t2.item) as ItemCount
      from myTable t2
      where t2.DateCol between dateadd(day,-2,t1.DateCol) and t1.DateCol
  ) oa
order by t1.dateCol ASC

结果:

|              dateCol | ItemCount |
|----------------------|-----------|
| 2018-01-01T00:00:00Z |         2 |
| 2018-01-02T00:00:00Z |         3 |
| 2018-01-03T00:00:00Z |         3 |
| 2018-01-04T00:00:00Z |         1 |

在使用apply之前,通过减少date列可以获得一些性能提升,例如:

select 
       d.date
     , oa.ItemCount
from (
    select distinct t1.date
    from myTable t1
     ) d
outer apply (
      select count(distinct t2.item) as ItemCount
      from myTable t2
      where t2.Date between dateadd(day,-2,d.Date) and d.Date
  ) oa
order by d.date ASC
;

在那个子查询中,你可以使用group by代替select distinct,但执行计划将保持不变。

SQL Fiddle上的演示


谢谢。不过看起来非常慢。我们是否可以想象加入三个表,每个表有不同的滞后,并在连接表上运行通常的去重计数? - RockScience
1
你在 DateCol 上有索引吗?你看过执行计划了吗? - Paul Maxwell
Cross apply会更快。无论如何,@RockScience,apply在任何情况下都比使用LAG要快得多。您可以自己进行实验并阅读许多相关文章。举个例子,在类似您的情况下,在我的生产数据库中,大约有1500万行,使用apply运行时间为5分钟,而使用LAG则需要3小时。 - rpd

5
最直接的解决方案是根据日期将表格自身连接起来:
SELECT t1.DateCol, COUNT(DISTINCT t2.Item) AS C
FROM testdata AS t1 
LEFT JOIN testdata AS t2 ON t2.DateCol BETWEEN DATEADD(dd, -2, t1.DateCol) AND t1.DateCol
GROUP BY t1.DateCol
ORDER BY t1.DateCol

输出:

| DateCol                 | C |
|-------------------------|---|
| 2018-01-01 00:00:00.000 | 2 |
| 2018-01-02 00:00:00.000 | 3 |
| 2018-01-03 00:00:00.000 | 3 |
| 2018-01-04 00:00:00.000 | 1 |

3

GROUP BYDISTINCT 更快(确保在 Date 列上有索引)

DECLARE @tbl TABLE([Date] DATE, [Item] VARCHAR(100))
;

INSERT INTO @tbl  VALUES
    ('2018-01-01 00:00:00', 'A'),
    ('2018-01-01 00:00:00', 'B'),
    ('2018-01-02 00:00:00', 'C'),
    ('2018-01-03 00:00:00', 'C'),
    ('2018-01-04 00:00:00', 'C');

SELECT t.[Date]

      --Just for control. You can take this part away
      ,(SELECT DISTINCT t2.[Item] AS [*]
        FROM @tbl AS t2
        WHERE t2.[Date]<=t.[Date] 
          AND t2.[Date]>=DATEADD(DAY,-2,t.[Date]) FOR XML PATH('')) AS CountedItems

      --This sub-select comes back with your counts 
      ,(SELECT COUNT(DISTINCT t2.[Item])
        FROM @tbl AS t2
        WHERE t2.[Date]<=t.[Date] 
          AND t2.[Date]>=DATEADD(DAY,-2,t.[Date])) AS ItemCount
FROM @tbl AS t
GROUP BY t.[Date];

结果
Date        CountedItems    ItemCount
2018-01-01  AB              2
2018-01-02  ABC             3
2018-01-03  ABC             3
2018-01-04  C               1

2

这个解决方案与其他解决方案不同。您能否在实际数据上比较此查询的性能和其他答案?

基本思路是每一行可以参与其自身日期、后一天或后两天的窗口。因此,首先将该行扩展为三行,并附上不同的日期,然后可以使用常规的COUNT(DISTINCT)对计算出的日期进行聚合。 HAVING子句只是为了避免返回仅在基础数据中计算而不存在的日期结果。

with cte(Date, Item) as (
    select cast(a as datetime), b 
    from (values 
        ('01/01/2018','A')
        ,('01/01/2018','B')
        ,('02/01/2018','C')
        ,('03/01/2018','C')
        ,('04/01/2018','C')) t(a,b)
)

select 
    [Date] = dateadd(dd, n, Date), [Count] = count(distinct Item)
from 
    cte
    cross join (values (0),(1),(2)) t(n)
group by dateadd(dd, n, Date)
having max(iif(n = 0, 1, 0)) = 1

option (force order)

输出:

|        Date             | Count |
|-------------------------|-------|
| 2018-01-01 00:00:00.000 |   2   |
| 2018-01-02 00:00:00.000 |   3   |
| 2018-01-03 00:00:00.000 |   3   |
| 2018-01-04 00:00:00.000 |   1   |

如果您有许多重复的行,使用以下方法可能会更快:

select 
    [Date] = dateadd(dd, n, Date), [Count] = count(distinct Item)
from 
    (select distinct Date, Item from cte) c
    cross join (values (0),(1),(2)) t(n)
group by dateadd(dd, n, Date)
having max(iif(n = 0, 1, 0)) = 1

option (force order)

谢谢。请问您能否澄清一下我应该运行哪个命令,假设我的表名为myTable?目前我收到了错误信息“SQL Server数据库错误:“a”不是一个被识别的表提示选项。如果它是作为表值函数或CHANGETABLE函数的参数,则请确保您的数据库兼容模式设置为90。” - RockScience
在上述查询中,我使用了通用表达式作为您的表,并填充了示例数据。这对您来说并非必需。因此,您必须运行以SELECT语句开头的部分,并将cte更改为myTable。你的SQL Server版本是什么? - uzi
非常感谢 @Martin Smith 为我的查询添加描述。 - uzi
使用CROSS APPLY比使用CROSS JOIN更快,因此在像这样的情况下,您不需要真正地连接来自不同表的数据,请将CROSS JOIN更改为CROSS APPLY。 - rpd

1
使用GETDATE()函数获取当前日期,使用DATEADD()函数获取最近3天。
 SELECT Date, count(DISTINCT item) 
 FROM myTable 
 WHERE [Date] >= DATEADD(day,-3, GETDATE())
 GROUP BY Date 
 ORDER BY Date

谢谢,这给了我一个点。我希望能在每个日期都得到这个。 - RockScience
抱歉,我的答案有什么问题吗?您能发布一些示例数据以及您需要的结果吗? - Juan Carlos Oropeza
  1. 在你的查询中,“day”是什么?
  2. 在问题中添加了示例。我不想要过去的最后3天,我想要每个日期在滚动3天内的唯一项目。
- RockScience
天是您想要添加或减去的单位,可以是月份、年份。但看起来在添加示例数据和奖励后,您现在正在获得更好的答案。 - Juan Carlos Oropeza

1

SQL

SELECT DISTINCT Date,
       (SELECT COUNT(DISTINCT item)
        FROM myTable t2
        WHERE t2.Date BETWEEN DATEADD(day, -2, t1.Date) AND t1.Date) AS count
FROM myTable t1
ORDER BY Date;

演示

Rextester演示:http://rextester.com/ZRDQ22190


1

由于不支持 COUNT(DISTINCT item) OVER (PARTITION BY [Date]),您可以使用 dense_rank 来模拟:

SELECT Date, dense_rank() over (partition by [Date] order by [item]) 
+ dense_rank() over (partition by [Date] order by [item] desc) 
- 1 as count_distinct_item
FROM myTable 

需要注意的一件事是,dense_rank会将null计算在内,而COUNT则不会。
有关更多详细信息,请参阅this帖子。

0
这里有一个简单的解决方案,它使用myTable本身作为日期分组的源(已编辑为SQLServer dateadd)。请注意,此查询假定myTable中至少有一条记录与每个日期对应;如果任何日期缺失,则即使2天前有记录,它也不会出现在查询结果中:
select
    date,
    (select
        count(distinct item)
        from (select distinct date, item from myTable) as d2
     where
        d2.date between dateadd(day,-2,d.date) and d.date
    ) as count
from (select distinct date from myTable) as d

0

我用数学方法解决了这个问题。

z(任意一天)= 3x + y(y是模3的值) 我需要从3 *(x-1)+ y + 1到3 *(x-1)+ y + 3

3 *(x-1)+ y + 1 = 3 *(z / 3-1)+ z%3 + 1

在这种情况下,我可以使用group by(在3 *(z / 3-1)+ z%3 + 1和z之间)

    SELECT  iif(OrderDate between  3 * (cast(OrderDate as int) / 3 - 1) + (cast(OrderDate as int) % 3) + 1 
and orderdate, Orderdate, 0)
, count(sh.SalesOrderID) FROM Sales.SalesOrderDetail shd
JOIN Sales.SalesOrderHeader sh on sh.SalesOrderID = shd.SalesOrderID
group by iif(OrderDate between  3 * (cast(OrderDate as int) / 3 - 1) + (cast(OrderDate as int) % 3) + 1 
and orderdate, Orderdate, 0)
order by iif(OrderDate between  3 * (cast(OrderDate as int) / 3 - 1) + (cast(OrderDate as int) % 3) + 1 
and orderdate, Orderdate, 0)

如果你需要其他的日期组合,可以使用:;

declare @n int = 4 (another day count)

SELECT  iif(OrderDate between  @n * (cast(OrderDate as int) / @n - 1) + (cast(OrderDate as int) % @n) + 1 
and orderdate, Orderdate, 0)
, count(sh.SalesOrderID) FROM Sales.SalesOrderDetail shd
JOIN Sales.SalesOrderHeader sh on sh.SalesOrderID = shd.SalesOrderID
group by iif(OrderDate between  @n * (cast(OrderDate as int) / @n - 1) + (cast(OrderDate as int) % @n) + 1 
and orderdate, Orderdate, 0)
order by iif(OrderDate between  @n * (cast(OrderDate as int) / @n - 1) + (cast(OrderDate as int) % @n) + 1 
and orderdate, Orderdate, 0)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接