使用 TSQL 从字符串中提取电子邮件地址

8

我正在尝试从现有评论字段中提取电子邮件地址,并将其放入自己的列中。字符串可能是这样的:"这是一个示例评论,其中包含某个电子邮件地址someemail@domain.org" 或者只是电子邮件本身 "someemail@domain.org"。

我认为最好的做法是找到 '@' 符号的索引,并向两个方向搜索,直到要么到达字符串的末尾,或者遇到一个空格。有人能帮我实现这个功能吗?


1
我会使用PATINDEX来查找电子邮件地址的起始位置。在网上搜索电子邮件地址的模式:您将从最简单的模式到最复杂的模式,这些模式甚至可能无法被SQL-Server识别。然后,我将使用CHARINDEX来定位下一个空格或字符串的结尾(如果CHARINDEX没有返回任何内容)。 - cha
8个回答

11

我知道wewesthemenace已经回答了这个问题,但他/她的解决方案似乎过于复杂。为什么要将电子邮件地址的左侧和右侧连接在一起呢?我宁愿找到电子邮件地址的开头和结尾,然后使用子字符串返回电子邮件地址,如下所示:

我的表格

DECLARE @Table TABLE (comment NVARCHAR(50));
INSERT INTO @Table
VALUES ('blah MyEmailAddress@domain.org'),            --At the end
        ('blah MyEmailAddress@domain.org blah blah'), --In the middle
        ('MyEmailAddress@domain.org blah'),           --At the beginning
        ('no email');

实际查询:

SELECT  comment,        
        CASE
            WHEN CHARINDEX('@',comment) = 0 THEN NULL
            ELSE SUBSTRING(comment,beginningOfEmail,endOfEmail-beginningOfEmail)
        END email
FROM @Table
CROSS APPLY (SELECT CHARINDEX(' ',comment + ' ',CHARINDEX('@',comment))) AS A(endOfEmail)
CROSS APPLY (SELECT DATALENGTH(comment)/2 - CHARINDEX(' ',REVERSE(' ' + comment),CHARINDEX('@',REVERSE(' ' + comment))) + 2) AS B(beginningOfEmail)

结果:

comment                                            email
-------------------------------------------------- --------------------------------------------------
blah MyEmailAddress@domain.org                     MyEmailAddress@domain.org
blah MyEmailAddress@domain.org blah blah           MyEmailAddress@domain.org
MyEmailAddress@domain.org blah                     MyEmailAddress@domain.org
no email                                           NULL

1
当我使用它时,似乎会抛出“传递给左侧或子字符串函数的无效长度参数”异常。 - Anthony
2
我的表中的数据类型是nvarchar,所以我将DATALENGTH更改为LEN,然后就可以了。谢谢。 - Anthony
1
LEN()忽略末尾的空格。DATALENGTH()不会忽略空格,但它的工作方式有些不同。它列出字节。因此,对于VARCHAR(非Unicode),字节=字符串长度。对于NVARCHAR(Unicode),您需要使用DATALENGTH()除以2。 - Stephan
对我来说效果很好,比被接受的解决方案简洁得多,该解决方案在处理我的数据时还生成了错误的结果。此外,从中学到了一个知识点,即LEN函数不包括字符串末尾的空格,这是我之前不知道的! - Shaun

7
你可以在字符串中搜索'@'。然后获取'@'左侧和右侧的字符串。然后,你想要反转'@'左侧的字符串,并找到第一个出现的空格,然后从那里获取SUBSTRING。然后将其反转以获得原始形式。同样的准则适用于右侧,不需要进行REVERSE
例子字符串:'some text someemail@domain.org some text'
  1. LEFT = 'some text someemail'
  2. RIGHT = '@domain.org some text'
  3. 反转左侧 = 'liameemos txet emos'
  4. SUBSTRING 到第一个空格 = 'liameemos'
  5. REVERSE(4) = someemail
  6. SUBSTRING (2) 到第一个空格 = '@domain.org'
  7. 组合步骤 5 和 6 = 'someemail@domain.org'
查询内容为:
;WITH CteEmail(email) AS(
    SELECT 'someemail@domain.org' UNION ALL
    SELECT 'some text someemail@domain.org some text' UNION ALL
    SELECT 'no email'
)
,CteStrings AS(
    SELECT
        [Left] = LEFT(email, CHARINDEX('@', email, 0) - 1),
        Reverse_Left = REVERSE(LEFT(email, CHARINDEX('@', email, 0) - 1)),
        [Right] = RIGHT(email, CHARINDEX('@', email, 0) + 1)
    FROM CteEmail
    WHERE email LIKE '%@%'
)
SELECT *,
    REVERSE(
        SUBSTRING(Reverse_Left, 0, 
            CASE
                WHEN CHARINDEX(' ', Reverse_Left, 0) = 0 THEN LEN(Reverse_Left) + 1
                ELSE CHARINDEX(' ', Reverse_Left, 0)
            END
        )
    )
    +
    SUBSTRING([Right], 0,
        CASE
            WHEN CHARINDEX(' ', [Right], 0) = 0 THEN LEN([Right]) + 1
            ELSE CHARINDEX(' ', [Right], 0)
        END
    )
FROM CteStrings

样本数据:

email
----------------------------------------
someemail@domain.org
some text someemail@domain.org some text
no email

结果

---------------------
someemail@domain.org
someemail@domain.org

请务必阅读有关 SUBSTRINGLEFTRIGHT 函数的内容。 - Felix Pamittan
1
右侧列存在问题。应该包括LEN。[Right] = RIGHT(email, LEN(email) - CHARINDEX('@', email, 0) + 1) - andres descalzo

6

Stephan的回答非常适用于在每一行中查找单个电子邮件地址。

但是,当尝试获取每行中的多个电子邮件地址时,我遇到了以下错误:

传递给LEFT或SUBSTRING函数的无效长度参数

我使用了DBA Stack Exchange中的这个答案来获取字符串中所有@的位置。它包括一个返回与字符串中某个模式相等的位置数的表值函数。我还必须修改CROSS APPLY函数以处理多个电子邮件地址。

我的表格

DECLARE @Table TABLE (comment VARCHAR(500));
INSERT INTO @Table (comment)
VALUES ('blah blah My.EmailAddress@domain.org more blah someemailaddress@domain.com even more blah asdf@gmail.com'),
       ('blah hello.world@domain.org more'),
       ('no email')

表值函数:

CREATE FUNCTION dbo.fnFindPatternLocation
(
    @string NVARCHAR(MAX),
    @term   NVARCHAR(255)
)
RETURNS TABLE
AS
    RETURN 
    (
        SELECT pos = Number - LEN(@term) 
        FROM (SELECT Number, Item = LTRIM(RTRIM(SUBSTRING(@string, Number, 
        CHARINDEX(@term, @string + @term, Number) - Number)))
        FROM (SELECT ROW_NUMBER() OVER (ORDER BY [object_id])
        FROM sys.all_objects) AS n(Number)
        WHERE Number > 1 AND Number <= CONVERT(INT, LEN(@string))
        AND SUBSTRING(@term + @string, Number, LEN(@term)) = @term
    ) AS y);
GO

查询:

SELECT comment, pos, SUBSTRING(comment,beginningOfEmail,endOfEmail-beginningOfEmail) AS email
FROM @Table
CROSS APPLY (SELECT pos FROM dbo.fnFindPatternLocation(comment, '@')) AS A(pos)
CROSS APPLY (SELECT CHARINDEX(' ',comment + ' ', pos)) AS B(endOfEmail)
CROSS APPLY (SELECT pos - CHARINDEX(' ', REVERSE(SUBSTRING(comment, 1, pos))) + 2) AS C(beginningOfEmail)

结果:

comment
---------------------------------------------------------------------------------------------------------
blah blah My.EmailAddress@domain.org more blah someemailaddress@domain.com even more blah asdf@gmail.com
blah blah My.EmailAddress@domain.org more blah someemailaddress@domain.com even more blah asdf@gmail.com
blah blah My.EmailAddress@domain.org more blah someemailaddress@domain.com even more blah asdf@gmail.com
blah hello.world@domain.org more

pos    email
---    ------------------------------
26     My.EmailAddress@domain.org
64     someemailaddress@domain.com
95     asdf@gmail.com
17     hello.world@domain.org

非常有帮助。我添加了一些PATINDEX子句,以避免像“确保@mesomething”或其他不准确的内容。 - Nick Oetjen

3
DECLARE @t TABLE (row_id INT, email VARCHAR(100))

INSERT @t (row_id, email)
VALUES (1, 'drgkls<ivan@gvi.ru>, info@gvi.com, @ dgh507-16-65@'),
        (2, 'hjshfkjshfj@kjs.kjsehf herwfjewr@kjsd.com adjfhja@.com u3483dhj@hhb@.dfj'),
        (3, 'kjsdghfjs4254.23detygh@jhjdfg.dgb лдоврывплдоо isgfsi@ klsdfksdl@,dd.com')

DECLARE @pat VARCHAR(100) = '%[^a-z0-9@._ ]%';

WITH f AS (
         SELECT    row_id,
                 CAST(' ' + email + ' ' AS VARCHAR(102)) email,
                 SUBSTRING(email, PATINDEX(@pat, email), 1) bad,
                 PATINDEX(@pat, email) pat
         FROM    @t
         UNION ALL
         SELECT    row_id,
                 CAST(REPLACE(email, bad, ' ') AS VARCHAR(102)),
                 SUBSTRING(REPLACE(email, bad, ' '), PATINDEX(@pat, REPLACE(email, bad, ' ')), 1) bad,
                 PATINDEX(@pat, REPLACE(email, bad, ' '))
         FROM    f
         WHERE    PATINDEX(@pat, email) > 0
     ),
     s AS 
     (
         SELECT    row_id,
                 email, PATINDEX('%@%', email) pos 
         FROM    f 
         WHERE    pat = 0
                 AND    PATINDEX('%@%', email) > 0
         UNION ALL
         SELECT    row_id,
                 SUBSTRING(email, pos + 1, 102), 
                 PATINDEX('%@%', SUBSTRING(email, pos + 1, 102))
         FROM    s
         WHERE    PATINDEX('%@%', SUBSTRING(email, pos + 1, 102)) > 0
     )

SELECT  row_id, o1 + pp
FROM    s   
        CROSS APPLY (SELECT    REVERSE(LEFT(email, pos -1)) s1) x
        CROSS APPLY (SELECT    CHARINDEX(' ', s1) i1) y
        CROSS APPLY (SELECT    REVERSE(LEFT(s1, i1 -1)) o1 WHERE i1 > 0) z
        CROSS APPLY (SELECT    CHARINDEX(' ', email, pos) i2) e
        CROSS APPLY (SELECT    SUBSTRING(email, pos, i2 -pos) pp WHERE    i2 > pos + 1) q
WHERE    LEN(o1) > 1
        AND CHARINDEX('.', pp) > 0
        AND PATINDEX('%@%@%', pp) = 0
        AND PATINDEX('%@.%', pp) = 0
        AND PATINDEX('%.', pp) = 0

1
如果你需要在一个函数中使用,那么这个对我来说是有效的...
CREATE FUNCTION [dbo].[extractEmail]
(
    @input nvarchar(500)
)
RETURNS nvarchar(100)
AS
BEGIN
    DECLARE @atPosition int
    DECLARE @firstRelevantSpace int
    DECLARE @name nvarchar(100)
    DECLARE @secondRelelvantSpace int
    DECLARE @everythingAfterAt nvarchar(500)
    DECLARE @domain nvarchar(100)
    DECLARE @email nvarchar(100) = ''
    IF CHARINDEX('@', @input,0) > 0
    BEGIN
        SET @input = ' ' + @input
        SET @atPosition = CHARINDEX('@', @input, 0)
        SET @firstRelevantSpace = CHARINDEX(' ',REVERSE(LEFT(@input, CHARINDEX('@', @input, 0) - 1)))
        SET @name = REVERSE(LEFT(REVERSE(LEFT(@input, @atPosition - 1)),@firstRelevantSpace-1))
        SET @everythingAfterAt = SUBSTRING(@input, @atPosition,len(@input)-@atPosition+1)
        SET @secondRelelvantSpace = CHARINDEX(' ',@everythingAfterAt)
        IF @secondRelelvantSpace = 0
            SET @domain = @everythingAfterAt
        ELSE
            SET @domain = LEFT(@everythingAfterAt, @secondRelelvantSpace)
        SET @email = @name + @domain
    END
    RETURN @email
END

0

针对包含换行符的字符串,我修改了Felix的答案,使用PATINDEX搜索第一个控制字符而不是空格。

我还必须修改Right字段以减去正确数量的文本。

    WITH CteEmail(email) AS(
        SELECT 'example string with new lines

    Email: some.example@email.address.com
(first email address - should be returned)

    Email: another@test.co.uk
(other email addresses should be ignored

more example text' UNION ALL
        SELECT 'Email: some.example@email.address.com' UNION ALL
        SELECT 'someemail@domain.org' UNION ALL
        SELECT 'some text someemail@domain.org some text' UNION ALL
        SELECT 'no email'
    )
    ,CteStrings AS(
        SELECT
            [Left] = LEFT(email, CHARINDEX('@', email, 0) - 1),
            Reverse_Left = REVERSE(LEFT(email, CHARINDEX('@', email, 0) - 1)),
            [Right] = RIGHT(email, LEN(email) - CHARINDEX('@', email, 0) + 1 )
        FROM CteEmail
        WHERE email LIKE '%@%'
    )
    SELECT *,
        REVERSE(
            SUBSTRING(Reverse_Left, 0, 
                CASE
                    WHEN PATINDEX('%[' + CHAR(10)+'- ]%', Reverse_Left) = 0 THEN LEN(Reverse_Left) + 1
                    ELSE PATINDEX('%[' + CHAR(0)+'- ]%', Reverse_Left)
                END
            )
        )
        +
        SUBSTRING([Right], 0,
            CASE
                WHEN PATINDEX('%[' + CHAR(0)+'- ]%', [Right]) = 0 THEN LEN([Right]) + 1
                ELSE PATINDEX('%[' + CHAR(0)+'- ]%', [Right])
            END
        )
    FROM CteStrings

0
这一行也可以(虽然有点长哈哈):
--declare @a varchar(100) 
--set @a = 'a asfd saasd asdfgh@asd.com wqe z zx cxzc '
select substring(substring(@a,0,charindex('@',@a)),len(substring(@a,0,charindex('@',@a)))-charindex(' ',reverse(substring(@a,0,charindex('@',@a))))+2,len(substring(@a,0,charindex('@',@a)))) + substring(substring(@a,charindex('@',@a),len(@a)),0,charindex(' ',substring(@a,charindex('@',@a),len(@a))))

0
使用Cymorg的函数:我遇到了一个问题,我的数据包含CR/LF,这阻止了函数100%的工作。很难找出原因,因为在使用该函数进行选择语句时,它会偶尔返回不正确的结果。如果我从查询结果中复制有问题的文本,并使用带引号的sql打印调用函数,它就可以正常工作。难以置信!
经过多次尝试,我使用了sql replace将CR/LF替换为空格,万岁!我是一个优秀的猜测者。
select extractEmail(replace(replace(MyColumn,CHAR(10),' '),CHAR(13),' ')) as AsYouWish from FacilityContacts

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接