如何从一个文本(NVARCHAR(MAX))列中提取一个或多个URL?

6

我在我的表格中有一个数据列,其中每行可能会有零个、一个或多个URL以及其他文本。我想要将这些URL提取到一个新的数据集中,只包含这些URL。

为什么?因为我想将其中一些URL添加到我的数据库的屏蔽列表中,以防止垃圾邮件。

例如,我在数据列中有以下文本:

hmaruqbtufcvdlfu, <a href="httx://portugal-forex.com/">Day forex signal strategy trading</a>, KzxiIIO, [url=httx://portugal-forex.com/]Forex Broker[/url], mtNZQDi, httx://portugal-forex.com/ The best forex broker, IBWlBzg, <a href="httx://phen375treatment.com/">Avantage inconveniant phen 375</a>, ApEuXTp, [url=httx://phen375treatment.com/]Phen375[/url], QDVLpSn, httx://phen375treatment.com/ Where to buy phen 375, Fnwpugj, <a href="httx://priligy2000.org/">Priligy t</a>, zwRZhIC, [url=httx://priligy2000.org/]Order priligy[/url], FBgSaWs, httx://priligy2000.org/ Priligy buy online, FsemWnW, <a href="httx://ossorio.org/">Online Casino</a>, aOBtTaK, [url=httx://ossorio.org/]Online Casino[/url], oMMMacf, httx://ossorio.org/ Free online casino bounuses, occFLyZ, <a href="httx://paroxetine247.com/">Paroxetine adema</a>, xvrIdnq, [url=httx://paroxetine247.com/]Paroxetine depression[/url], MLSRAXX, httx://paroxetine247.com/ Paroxetine dark skin, GLYTcZY, <a href="httx://resolvedisputes.org/">Fioricet prescription online</a>, PmEMaMA, [url=httx://resolvedisputes.org/]Fioricet wcodiene for headache[/url], vPlKLhq, httx://resolvedisputes.org/ Online pharmacy fioricet, fxfhRcV.

然后,我希望你能将文本中的所有网址提取出来:

httx://portugal-forex.com/
httx://phen375treatment.com/
httx://priligy2000.org/
And so on.

我真的不知道从哪里开始在SQL中执行这个操作。


你只需要获取主域名,例如httx://portugal-forex.com/,或者也可以是httx://portugal-forex.com/xxx?Page=2。 - Darka
主域名就足够了。 - Daniel Björk
1个回答

5

这里有一个例子。我要从“httx://”到第一个“/”搜索字符串:

无论如何,您都需要逐行进行。

将代码放入函数中

CREATE FUNCTION Temporary.getLinksFromText (@Tekstas NVARCHAR(MAX))
RETURNS @Data TABLE(TheLink NVARCHAR(500))
AS
BEGIN

    DECLARE @FirstIndexOfChar INT,
            @LastIndexOfChar INT,
            @LengthOfStringBetweenChars INT ,
            @String VARCHAR(MAX)

   SET @FirstIndexOfChar    = CHARINDEX('httx://',@Tekstas,0) 

    WHILE @FirstIndexOfChar > 0
    BEGIN

        SET @String = ''
        SET @LastIndexOfChar    = CHARINDEX('/',@Tekstas,@FirstIndexOfChar+7)
        SET @LengthOfStringBetweenChars = @LastIndexOfChar - @FirstIndexOfChar + 1

        SET @String = SUBSTRING(@Tekstas,@FirstIndexOfChar,@LengthOfStringBetweenChars)
        INSERT INTO @Data (TheLink) VALUES (@String);

        SET @Tekstas = SUBSTRING(@Tekstas, @LastIndexOfChar, LEN(@Tekstas))
        SET @FirstIndexOfChar = CHARINDEX('httx://',@Tekstas, 0) 

    END 

    RETURN
END

创建一些测试数据:

CREATE TABLE  #Data(weLink NVARCHAR(MAX));
INSERT INTO #Data VALUES 
('hmaruqbtufcvdlfu, <a href="httx://portugal-forex.com/">Day forex signal strategy trading</a>, KzxiIIO, [url=httx://portugal-forex.com/]Forex Broker[/url], mtNZQDi, httx://portugal-forex.com/ The best forex broker, IBWlBzg, <a href="httx://phen375treatment.com/">Avantage inconveniant phen 375</a>, ApEuXTp, [url=httx://phen375treatment.com/]Phen375[/url], QDVLpSn, httx://phen375treatment.com/ Where to buy phen 375, Fnwpugj, <a href="httx://priligy2000.org/">Priligy t</a>, zwRZhIC, [url=httx://priligy2000.org/]Order priligy[/url], FBgSaWs, httx://priligy2000.org/ Priligy buy online, FsemWnW, <a href="httx://ossorio.org/">Online Casino</a>, aOBtTaK, [url=httx://ossorio.org/]Online Casino[/url], oMMMacf, httx://ossorio.org/ Free online casino bounuses, occFLyZ, <a href="httx://paroxetine247.com/">Paroxetine adema</a>, xvrIdnq, [url=httx://paroxetine247.com/]Paroxetine depression[/url], MLSRAXX, httx://paroxetine247.com/ Paroxetine dark skin, GLYTcZY, <a href="httx://resolvedisputes.org/">Fioricet prescription online</a>, PmEMaMA, [url=httx://resolvedisputes.org/]Fioricet wcodiene for headache[/url], vPlKLhq, httx://resolvedisputes.org/ Online pharmacy fioricet, fxfhRcV.'),
('hmaruqbtufcvdlfu, <a href="httx://portugal-forex.com/">Day forex signal strategy trading</a>, KzxiIIO, [url=httx://portugal-forex.com/]Forex Broker[/url], mtNZQDi, httx://portugal-forex.com/ The best forex broker, IBWlBzg, <a href="httx://phen375treatment.com/">Avantage inconveniant phen 375</a>, ApEuXTp, [url=httx://phen375treatment.com/]Phen375[/url], QDVLpSn, httx://phen375treatment.com/ Where to buy phen 375, Fnwpugj, <a href="httx://priligy2000.org/">Priligy t</a>, zwRZhIC, [url=httx://priligy2000.org/]Order priligy[/url], FBgSaWs, httx://priligy2000.org/ Priligy buy online, FsemWnW, <a href="httx://ossorio.org/">Online Casino</a>, aOBtTaK, [url=httx://ossorio.org/]Online Casino[/url], oMMMacf, httx://ossorio.org/ Free online casino bounuses, occFLyZ, <a href="httx://paroxetine247.com/">Paroxetine adema</a>, xvrIdnq, [url=httx://paroxetine247.com/]Paroxetine depression[/url], MLSRAXX, httx://paroxetine247.com/ Paroxetine dark skin, GLYTcZY, <a href="httx://resolvedisputes.org/">Fioricet prescription online</a>, PmEMaMA, [url=httx://resolvedisputes.org/]Fioricet wcodiene for headache[/url], vPlKLhq, httx://resolvedisputes.org/ Online pharmacy fioricet, fxfhRcV.')

您可以像这样执行它(不使用光标)
SELECT allLinks.*
FROM #Data AS D
OUTER APPLY Temporary.getLinksFromText (D.weLink) AS allLinks

这个解决方案是可行的,但我必须创建一个游标来处理所有行,这不是最优的。有没有可能不使用游标而使用基于集合的方法呢?我想处理这个语句“SELECT Data FROM Paste WHERE paste.CaptchaOK = 0”。 - Daniel Björk
你的意思是 paste.CaptchaOK = 0 是什么? - Darka
我认为无论如何你都需要一个一个地去检查。请查看更新后的答案。 - Darka
"paste.CaptchaOK = 0" 这只是我的选择,不用在意 =) - Daniel Björk

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接