从分组重复记录中选择最近日期的最佳方法

3

I have table like this:

CREATE TABLE [dbo].[TestToDelete](
    [id] [int] NULL,
    [Email] [nvarchar](50) NULL,
    [RawEmail] [nvarchar](50) NULL,
    [Status] [tinyint] NULL,
    [ValidationDate] [datetime] NULL
) ON [PRIMARY]

INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (1, N'a@a.ru', N'aaa@a.ru', 11, CAST(N'2017-02-07 14:00:30.300' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (2, N'a@a.ru', N'aaa@a.ru', 11, CAST(N'2017-02-07 14:00:52.347' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (3, N'a@a.ru', N'aaa@a.ru', 11, CAST(N'2017-02-07 14:00:58.117' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (4, N'a@a.ru', N'aaa@a.ru', 22, CAST(N'2017-02-07 14:01:08.360' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (5, N'b@b.ru', N'bbb@b.ru', 11, CAST(N'2017-02-07 14:01:21.783' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (6, N'b@b.ru', N'bbb@b.ru', 11, CAST(N'2017-02-07 14:01:29.310' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (7, N'b@b.ru', N'bbb@b.ru', 22, CAST(N'2017-02-07 14:01:37.050' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (8, NULL, N'bbb@b.ru', 0, CAST(N'2017-02-07 14:02:10.643' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (9, NULL, N'aaa@a.ru', 0, CAST(N'2017-02-07 14:02:22.160' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (10, N'anew@a.ru', N'aaa@a.ru', 11, CAST(N'2017-02-07 15:30:01.637' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (11, N'anew@a.ru', N'aaa@a.ru', 11, CAST(N'2017-02-07 15:30:06.657' AS DateTime))
INSERT [dbo].[TestToDelete] ([id], [Email], [RawEmail], [Status], [ValidationDate]) VALUES (12, N'anew@a.ru', N'aaa@a.ru', 11, CAST(N'2017-02-07 15:30:12.160' AS DateTime))

我需要选择出现3次或更多且日期最近的记录(EmailRawEmailStatus字段)。在这个表格中,它是:

'anew@a.ru | aaa@a.ru | 11'

而不是

'a@a.ru | aaa@a.ru | 11'

因为anew@a.ru的日期更近。

执行此选择的查询:

select * from
(
   select email, rawEmail, Status, 
   ROW_NUMBER() OVER(PARTITION BY rawEmail ORDER BY vdate DESC) num 
   from
      (select max([ValidationDate]) vdate, email, rawEmail, Status
         from TestToDelete where status in (11, 22)
         group by rawEmail, email, status 
         having count(*) > 2
      ) tmp 
)final where num = 1

有没有可能使用更少的子查询完成这个任务(而不是像现在这样使用3个)?


更新: 当出现3个或更多时,期望输出:

anew@a.ru | aaa@a.ru | 11

当有2个或更多出现时,期望的输出为:

anew@a.ru | aaa@a.ru | 11
b@b.ru | bbb@b.ru | 11

您的意思是最新日期时间有3条或更多记录吗? - Vamsi Prabhala
@vkp 三个或更多相同记录之一具有最新的日期时间。 - Andrey
@vkp 如果一个元组包含了2010年、2011年和2016年的日期,而另一个元组包含了1997年、1998年和2017年的日期——我需要后者,因为2017年更近。 - Andrey
预期输出是一条记录还是总共3条?
  1. Python is a popular programming language for data analysis and machine learning.
  2. JavaScript is commonly used for front-end web development.
  3. Java is a widely-used programming language for developing mobile applications, web applications, and games.
预期输出:
  1. Python 是数据分析和机器学习常用的流行编程语言。
  2. JavaScript 常用于前端网页开发。
  3. Java 是广泛使用的编程语言,用于开发移动应用程序、Web 应用程序和游戏。
- TheGameiswar
1
你已经接近了最优解。你需要对一组子集进行排名,并选择排名最高的行。由于你不能在一个查询中同时执行这两个操作,所以你至少需要两个查询。 - Ross Bush
@TheGameiswar,我已经更新了帖子。预期输出是不同的元组,但如果这些元组具有相同的RawEmail,则只需要具有最新日期的那个元组。 - Andrey
4个回答

2
使用 with (common table expression)row_number()count() over()
如果我们可以按照 RawEmail, Statuscount() 进行分区,则:
;with cte as (
    select
        rn = row_number() over (
            partition by RawEmail
            order by ValidationDate desc
            )
        , cnt = count(*) over (
            partition by RawEmail, status
            )
        , *
    from TestToDelete
    where status in (11, 22)
    )
select * 
from cte o 
where o.rn=1 
  and o.cnt > 2

results: http://rextester.com/WYVZ86149

+----+-----+----+-----------+----------+--------+---------------------+
| rn | cnt | id |   Email   | RawEmail | Status |   ValidationDate    |
+----+-----+----+-----------+----------+--------+---------------------+
|  1 |   7 | 12 | anew@a.ru | aaa@a.ru |     11 | 07.02.2017 15:30:12 |
+----+-----+----+-----------+----------+--------+---------------------+

如果我们无法按照 RawEmail, Status 分区 count(),那么:
;with cte as (
    select
        rn = row_number() over (
            partition by RawEmail
            order by ValidationDate desc
            )
        , cnt = count(*) over (
            partition by RawEmail
            )
        , *
    from TestToDelete
    where status in (11, 22)
    )
select * 
from cte o 
where o.rn=1 
  and o.cnt > 2
  and exists (
    select 1
      from cte i 
      where i.RawEmail = o.RawEmail
        and i.Email != o.Email
      )

结果: http://rextester.com/YTQ30810

+----+-----+----+-----------+----------+--------+---------------------+
| rn | cnt | id |   Email   | RawEmail | Status |   ValidationDate    |
+----+-----+----+-----------+----------+--------+---------------------+
|  1 |   7 | 12 | anew@a.ru | aaa@a.ru |     11 | 07.02.2017 15:30:12 |
+----+-----+----+-----------+----------+--------+---------------------+

非常感谢。第一个查询正是我想要的。 - Andrey

1

试试这个:

;WITH CTE AS (
   SELECT *,
          RANK() OVER (PARTITION BY RawEmail
                       ORDER BY ValidationDate DESC) AS rn,
          COUNT(*) OVER (PARTITION BY Email, Status) AS cnt
   FROM TestToDelete
) 
SELECT *
FROM CTE 
WHERE rn = 1 AND cnt >= 3
ORDER BY ValidationDate DESC

这个查询使用了一个采用窗口函数的通用表达式:

  • RANK被用来获取每个RawEmail记录中最新的记录或记录(如果存在并列)
  • COUNT被用来确定每个Email, Status片段的人数

它只返回前1个。我需要所有出现3次或更多次的元组。但是,如果这些元组具有相同的“RawEmail” - 我想要具有最近日期的元组。 - Andrey

0

我相信这会满足你的需求,而不需要使用CTE或ROW_NUMBER()。

SELECT TOP 1 email, rawEmail, Status
FROM TestToDelete 
WHERE status IN (11, 22)
GROUP BYrawEmail, email, status 
HAVING COUNT(*) > 2
ORDER BY MAX(validationDate) DESC

但是top 1只返回一行。我需要所有出现3次或更多次的行,如果有两个相同的元组,我想要最近的那个。 - Andrey
你需要编辑你的问题。它只显示返回了1条记录。 - TTeeple

0

如果我理解正确,您当前的查询并没有实现您所表达的功能。您需要考虑日期因素。为了获取在最近日期有3个相同状态记录的行:

select ttd.*
from (select ttd.*,
             count(*) over (partition by email, rawemail, cast(ValidationDate as date)) as cnt,
             rank() over (partition by email, rawemail order by cast(ValidationDate as date) desc) as seqnum
      from TestToDelete ttd
      where status in ('11', '22')
     ) ttd
where seqnum = 1 and cnt >= 3;

如果您只想要电子邮件和状态,请使用:
select distinct email, rawemail, status

编辑:

我想到了您想知道最近三条记录是否都具有相同状态。这更容易:

select email, rawemail, max(status)
from (select ttd.*,
             row_number() over (partition by email, rawemail         
                                 order by ValidationDate desc) as seqnum
      from TestToDelete ttd
      where status in ('11', '22')
     ) ttd
where seqnum <= 3
group by email, rawemail
having min(status) = max(status)

我的当前查询正好符合我的要求。但是我想尽量减少子查询的数量。 - Andrey

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接