在SQL中,删除相邻重复的行并进行时间计算。

4
在Microsoft SQL Server 2012中,我需要删除下面“Flow”列中相邻重复的行,只保留第一行(用*标记以说明)。之后,我需要计算所有行的1和0之间的时间差,并得到总的累积时间。
Record Number    Downhole Time      Flow
-------------------------------------------
0        03/27/2013 19:23:48.582    1       *
58       03/27/2013 19:28:12.606    1   
137      03/27/2013 19:32:16.070    0       *
143      03/27/2013 19:33:59.070    0   
255      03/27/2013 19:40:14.070    0   
272      03/29/2013 14:43:55.071    1       *
289      03/29/2013 14:45:44.070    1   
293      03/29/2013 14:45:59.071    0       *
294      03/29/2013 14:46:10.070    0   

删除相邻记录后的结果:

Record Number    Downhole Time      Flow
-------------------------------------------
0        03/27/2013 19:23:48.582    1       *
137      03/27/2013 19:32:16.070    0       *
272      03/29/2013 14:43:55.071    1       *
293      03/29/2013 14:45:59.071    0       *

期望最终结果

cumulative time difference = 
  (03/27/2013 19:32:16.070 - 03/27/2013 19:23:48.582) 
+ (03/29/2013 14:45:59.071 - 03/29/2013 14:43:55.071) 
+ if there are more rows.

这是哪种数据库(SQL Server、MySQL等)? - Martin
“first one” 是指最低的 Unique 还是最早的 Time? -- “time difference for all rows, each consecutive pair” 指的是所有行之间的时间差异,每个连续的一对是什么意思?你能举个例子吗? - Daniel Sparing
@DanielSparing 我已经编辑并添加了示例步骤。 - Pavan
我已经帮你开始格式化查询了。请完成它,以便其他人可以阅读! - ErikE
1
请将你的解决方案(答案)从问题中移除并在下面作为回答发表。同时,您可以接受它(使用左侧的绿色勾选标记)。谢谢! - abatishchev
显示剩余3条评论
3个回答

1
我相信这个可以满足您的要求:

WITH FlowIntervals AS (
   SELECT
      FromTime = Min(D.[Downhole Time]),
      X.ToTime
   FROM
      dbo.vLog D
      OUTER APPLY (
         SELECT TOP 1 ToTime = D2.[Downhole Time]
         FROM dbo.vLog D2
         WHERE
            D.[Downhole Time] < D2.[Downhole Time]
            AND D.[Flow] <> D2.[Flow]
         ORDER BY D2.[Downhole Time]
      ) X
   WHERE D.Flow = 1
   GROUP BY X.ToTime
)
SELECT Sum(DateDiff(ms, FromTime, IsNull(ToTime, GetDate())) / 1000.0)
FROM FlowIntervals
;

这个查询适用于SQL 2005及以上版本。它的性能表现不错,但需要对vLog表进行自连接,因此可能无法像使用LEADLAG的解决方案那样表现得更好。
如果您正在寻求绝对最佳的性能,这个查询可能可以胜任:
WITH Ranks AS (
   SELECT
      Grp =
         Row_Number() OVER (ORDER BY [Downhole Time])
         - Row_Number() OVER (PARTITION BY Flow ORDER BY [Downhole Time]),
      [Downhole Time],
      Flow
   FROM dbo.vLog
), Ranges AS (
   SELECT
      Result = Row_Number() OVER (ORDER BY Min(R.[Downhole Time]), X.Num) / 2,
      [Downhole Time] = Min(R.[Downhole Time]),
      R.Flow, X.Num
   FROM
      Ranks R
      CROSS JOIN (SELECT 1 UNION ALL SELECT 2) X (Num)
   GROUP BY
      R.Flow, R.Grp, X.Num
), FlowStates AS (
   SELECT
      FromTime = Min([Downhole Time]),
      ToTime = CASE WHEN Count(*) = 1 THEN NULL ELSE Max([Downhole Time]) END,
      Flow = IsNull(Min(CASE WHEN Num = 2 THEN Flow ELSE NULL END), Min(Flow))
   FROM Ranges R
   WHERE Result > 0
   GROUP BY Result
)
SELECT
   ElapsedSeconds =
      Sum(DateDiff(ms, FromTime, IsNull(ToTime, GetDate())) / 1000.0)
FROM
   FlowStates
WHERE
   Flow = 1
;

使用您提供的示例数据,它返回 631.486000(秒)。如果只选择来自FlowStates CTE的行,则会得到以下结果:
FromTime                ToTime                  Flow
----------------------- ----------------------- ----
2013-03-27 19:23:48.583 2013-03-27 19:32:16.070 1
2013-03-27 19:32:16.070 2013-03-29 14:43:55.070 0
2013-03-29 14:43:55.070 2013-03-29 14:45:59.070 1
2013-03-29 14:45:59.070 NULL                    0

这个查询适用于SQL 2005及以上版本,性能非常好,可以与任何其他解决方案(包括使用LEADLAG的解决方案)相媲美,它通过一种巧妙的方式模拟了LEADLAG。我不能保证它一定会胜出,但它可能表现得非常出色,最终可能会胜出。

有关查询详细信息,请参见类似问题的答案

最后,对于SQL Server,这里是一个完整的Lag/Lead版本:

WITH StateChanges AS (
   SELECT
      [Downhole Time],
      Flow,
      Lag(Flow) OVER (ORDER BY [Downhole Time]) PrevFlow
   FROM
      dbo.vLog
), Durations AS (
   SELECT
      [Downhole Time], 
      Lead([Downhole Time]) OVER (ORDER BY [Downhole Time]) NextTime,
      Flow
   FROM
      StateChanges
   WHERE
      Flow <> PrevFlow
      OR PrevFlow IS NULL
)
SELECT ElapsedTime = Sum(DateDiff(ms, [Downhole Time], NextTime) / 1000.0)
FROM Durations
WHERE Flow = 1
;

这个查询需要 SQL Server 2012 或更高版本。它计算状态的变化(流程是否改变?),然后选择那些流程确实发生了变化的记录,最后计算从0到1(流程开始时)流程发生变化的时间段的持续时间。
我很想看看你对此查询的I/O和时间的实际性能结果与其他查询相比如何。如果仅查看执行计划,这个查询似乎会胜出 - 但在实际性能统计上可能并不是如此明显的赢家。

太棒了,谢谢@ErikE。你上次的查询用了5秒钟,我的查询用了8秒钟。 - Pavan
实际上,我的查询需要NoDuplicatesFlowTable部分。如果我将其删除,我们的性能就会匹配。你的查询在这个数据集中运气很好(存在误报风险)。 - Pavan
1
请您能否解释一下我的查询无法处理的更棘手的数据?另外,其他查询表现如何?表上有哪些索引?顺便说一下,我还犯了一个错误,不小心使用了“[Record Number]”而不是“[Downhole Time]”,现在已经修复了。 - ErikE

0

以下是我最终解决问题的方法。

WITH FlowEvntTable AS (
  /* the following gets raw data and adds Row# for the next select to use*/
  SELECT        ROW_NUMBER() OVER (ORDER BY [Downhole Time]) AS RNum,
                [Downhole Time],
                [Record Number],
                Value AS Flow
  FROM          [newMDF].[dbo].[vLog]
  WHERE         [Event Name] LIKE 'Flow%'
    AND         [Field Name] LIKE 'Flow'
), 
NoDuplicatesFlowTable AS (
  /*the following line came from StackOverflow "ignore adjacent matching rows" */
  SELECT        [Downhole Time],
                [Flow]
  FROM          FlowEvntTable A 
  WHERE         A.RNum NOT IN (
    SELECT        A.RNUM 
    FROM          FlowEvntTable A 
    JOIN          FlowEvntTable B 
       ON         B.RNum +1 = A.RNum 
       AND        B.Flow=A.Flow
  ) 
), 
FlowOffColAddedTable AS (
  SELECT        *, 
                lead([Downhole Time]) 
  OVER          (ORDER BY [Downhole Time]) AS NotFlowTime 
  FROM          NoDuplicatesFlowTable 
), 
FlowStartEndTimeTable AS (
  /*select above adds time offest by 1 row to a new column. now by Flow = 1/On, you get Start End On pairs */
  SELECT        [Downhole Time] AS StartTime,
                NotFlowTime AS EndTime 
  FROM          FlowOffColAddedTable 
  WHERE         Flow = 1
)

/*diff and sum the pairs*/
SELECT          Sum(DATEDIFF(ms,StartTime,EndTime))/1000 AS VibeOnSec 
FROM            FlowStartEndTimeTable 

在上面的Select *, lead...之后的中间结果。

Downhole Time         Flow  NotFlowTime  
-------------------------------------------  
2013-03-28 00:23:48.0000000 1   2013-03-28 00:32:16.0000000  
2013-03-28 00:32:16.0000000 0   2013-03-28 00:33:59.0000000  
2013-03-28 00:33:59.0000000 1   2013-03-28 00:40:14.0000000  
2013-03-28 00:40:14.0000000 0   2013-03-29 19:43:55.0000000  
2013-03-29 19:43:55.0000000 1   2013-03-29 19:45:44.0000000  

注意:不,这并不匹配原始数据集——它只是旨在给出大致的结果。

0

不确定您使用的是哪种数据库。以下是一种使用分析函数和 Oracle 的解决方案:

SELECT 
  un, 
  mytime, 
  flow,
  lead (mytime) OVER (ORDER BY UN) lead_time,
 (lead (mytime) OVER (ORDER BY UN) - mytime)*24*60 minutes
  FROM (  SELECT un,
                 mytime,
                 flow,
                 LAG (flow) OVER (ORDER BY UN) lag_val
            FROM test
        ORDER BY un) a
 WHERE a.flow != NVL (a.lag_val, 9999)

内部查询使用LAG分析函数获取先前流的值。外部选择的where子句过滤“重复”的流程(仅留下更改的rist出现)。外部选择还使用LEAD分析函数计算时间差异(以分钟为单位)。尽管您拥有大量数据,但从性能方面来说,这将是非常好的。让我知道您使用的数据库类型 - 大多数数据库都有分析函数实现(或解决方法)... 这只适用于Orace。

我需要用“From(Select ....)”替换你的“From test”。在最后一个order by中,我得到了以下错误:“在视图、内联函数、派生表、子查询中无法使用ORDER BY子句...除非也指定TOP、OFFSET或FOR XML。”救命! - Pavan
选择 [记录编号], [井下时间], 流量, lead ([井下时间]) OVER (ORDER BY [记录编号]) lead_time, (lead ([井下时间]) OVER (ORDER BY [记录编号]) - [井下时间])2460 分钟 FROM ( SELECT [记录编号], [井下时间], 流量, LAG (流量) OVER (ORDER BY [记录编号]) lag_val FROM ( Select [井下时间], [记录编号], 值 As 流量 FROM [newMDF].[dbo].[vLog] where [事件名称] like 'Flow%' AND [字段名称] like 'Flow' ) Test_table ORDER BY [记录编号]) a WHERE ... - Pavan
我在你的第6行代码中遇到了以下错误:数据类型datetime2无效。我暂时移除了order by,并用ISNULL替换了NVL。 - Pavan
嗯,这很奇怪。我不知道SQL Server不允许在内部选择中使用order by。这可能会破坏滞后函数...尝试从内部选择中删除order by子句-它将解决该错误。希望SQL Server也会将分析函数的order by子句应用于实际选择。让我看看如何在SQL Server中获取两个datetime列之间的时间差异。在Oracle中,您只需减去日期值,但显然在SQL Server中有所不同... - Koshera
尝试使用datediff而不是减法运算符。像这样的东西应该给你两者之间的时间差(以分钟为单位):DateDiff(s,lead([Downhole Time]) OVER (ORDER BY [RECORD NUMBER],[Downhole Time])%3600/60)。不确定您是否能够在datediff中使用lead函数作为参数。如果不能,则必须添加一个更外层的选择并在那里执行datediff... - Koshera
显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接