T-SQL中的Levenshtein距离

Question

T-SQL中的Levenshtein距离

t-sqledit-distancelevenshtein-distance

108

我对在T-SQL中计算Levenshtein距离的算法很感兴趣。

- Alexander Prokofyev

7个回答

63

Arnold Fribble在sqlteam.com/forums上有两个提议：

一个来自2005年6月，
另一个是更新的，来自2006年5月。

这是来自2006年的更新版本：

SET QUOTED_IDENTIFIER ON 
GO
SET ANSI_NULLS ON 
GO

CREATE FUNCTION edit_distance_within(@s nvarchar(4000), @t nvarchar(4000), @d int)
RETURNS int
AS
BEGIN
  DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
    @cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
  SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0
  WHILE @j <= @tl
    SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
  WHILE @i <= @sl
  BEGIN
    SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000
    WHILE @j <= @tl
    BEGIN
      SET @c = @c + 1
      SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END
      IF @c > @c1 SET @c = @c1
      SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
      IF @c > @c1 SET @c = @c1
      IF @c < @cmin SET @cmin = @c
      SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1
    END
    IF @cmin > @d BREAK
    SELECT @cv1 = @cv0, @i = @i + 1
  END
  RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
END
GO

- Alexander Prokofyev

1

@Alexander，看起来它可以工作，但我建议你把变量名改成更有意义的名称。此外，我会删掉 @d，在输入中你已经知道了两个字符串的长度。 - Lieven Keersmaekers

3

@Lieven：这不是我的实现，作者是Arnold Fribble。@d参数是字符串之间允许的最大差异，达到此差异后，它们被认为是太不相似了，函数就会返回-1。这是因为在T-SQL中，算法运行速度太慢了。 - Alexander Prokofyev

1

你应该查看一下算法伪代码，它位于这里：http://en.wikipedia.org/wiki/Levenshtein_distance 它没有太多改进。 - Norman H

14

据我所知，在SQL Server 2005以及之后的版本中，您可以使用任何.NET语言编写存储过程：在 SQL Server 2005 中使用 CLR 集成。有了这个，编写一个用于计算Levenstein距离的存储过程就不难了。

从帮助中提取出来的一个简单的“Hello, World!”：

using System;
using System.Data;
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;

public class HelloWorldProc
{
    [Microsoft.SqlServer.Server.SqlProcedure]
    public static void HelloWorld(out string text)
    {
        SqlContext.Pipe.Send("Hello world!" + Environment.NewLine);
        text = "Hello world!";
    }
}

在你的 SQL Server 中运行以下代码：

CREATE ASSEMBLY helloworld from 'c:\helloworld.dll' WITH PERMISSION_SET = SAFE

CREATE PROCEDURE hello
@i nchar(25) OUTPUT
AS
EXTERNAL NAME helloworld.HelloWorldProc.HelloWorld

现在你可以进行测试运行：

DECLARE @J nchar(25)
EXEC hello @J out
PRINT @J

希望这可以帮到你。

- Leandro López

7

你可以使用Levenshtein距离算法来比较字符串。

这里提供了一个T-SQL示例，网址是：http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx。

CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
 DECLARE @s1_len int, @s2_len int
 DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
 DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)

 SELECT
  @s1_len = LEN(@s1),
  @s2_len = LEN(@s2),
  @cv1 = 0x0000,
  @j = 1, @i = 1, @c = 0

 WHILE @j <= @s2_len
  SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1

 WHILE @i <= @s1_len
 BEGIN
  SELECT
   @s1_char = SUBSTRING(@s1, @i, 1),
   @c = @i,
   @cv0 = CAST(@i AS binary(2)),
   @j = 1

  WHILE @j <= @s2_len
  BEGIN
   SET @c = @c + 1
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
    CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
   IF @c > @c_temp SET @c = @c_temp
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
   IF @c > @c_temp SET @c = @c_temp
   SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
 END

 SELECT @cv1 = @cv0, @i = @i + 1
 END

 RETURN @c
END

（由Joseph Gama开发的函数）

用法：

select
 dbo.edit_distance('Fuzzy String Match','fuzzy string match'),
 dbo.edit_distance('fuzzy','fuzy'),
 dbo.edit_distance('Fuzzy String Match','fuzy string match'),
 dbo.edit_distance('levenshtein distance sql','levenshtein sql server'),
 dbo.edit_distance('distance','server')

该算法只需在一步内通过替换不同字符来返回将一个字符串更改为另一个字符串的步骤计数。

- Eralper

很遗憾，这并不包括字符串为空的情况。 - Codeman

2

在TSQL中，比较两个项的最佳和最快的方法是使用在索引列上连接表的SELECT语句。因此，如果您想从RDBMS引擎的优势中受益，则建议使用这种方法来实现编辑距离。TSQL循环也可以工作，但对于大量比较，Levenstein距离计算在其他语言中比在TSQL中更快。

我已经在几个系统中使用一系列针对仅用于此目的的临时表来实现编辑距离。它需要一些重型预处理步骤-准备临时表-但是它在大量比较方面非常有效。

简而言之：预处理包括创建、填充和索引临时表。第一个表包含参考ID、一个字母列和一个charindex列。通过运行一系列插入查询将每个单词分成字母（使用SELECT SUBSTRING）以创建与源列表中的单词具有字母相同数量的行，该表由此填充（我知道，那是很多行，但SQL服务器可以处理数十亿行）。然后制作一个带有2个字母列的第二个表，另一个带有3个字母列的表等。最终结果是一系列包含参考ID和每个单词的子字符串以及它们在单词中位置的引用的表。

一旦完成了这个步骤，整个游戏就是复制这些表并将它们与其副本在GROUP BY选择查询中连接，该查询计算匹配数量。这为每对单词创建了一系列度量标准，然后重新聚合为每对单词的单个Levenstein距离。

从技术上讲，这与Levenstein距离（或其变体）的大多数其他实现非常不同，因此您需要深入了解Levenstein距离的工作原理以及为什么设计为这样。还要调查其他替代方案，因为使用该方法，您最终会得到一系列底层指标，这些指标可以帮助同时计算许多编辑距离的变体，从而为您提供有趣的机器学习潜力改进。

另一个已经在此页面中提到的观点：尽可能预处理以消除不需要距离测量的成对项。例如，没有任何共同字母的两个单词应该被排除，因为可以从字符串的长度中获得编辑距离。或者不要测量相同单词的两个副本之间的距离，因为它的值自然为0。如果您的单词列表来自长文本，则在进行测量之前删除重复项很可能会出现相同的单词，因此仅测量距离一次将节省处理时间等。

- JeromeE

2

我也在寻找Levenshtein算法的代码示例，很高兴在这里找到了。当然，我想要理解算法是如何工作的，并且我尝试着运用Veve发布的一个以上示例进行实验。为了更好地理解代码，我创建了一个带有矩阵的EXCEL表格。 FUZZY与FUZY的距离图片胜过千言万语。

通过这个EXCEL表格，我发现还有潜在的性能优化空间。上右侧红色区域中的所有值都不需要计算。每个红色单元格的值都等于左边单元格的值加1。这是因为第二个字符串在该区域总是比第一个字符串长，每个字符增加1的距离值。

您可以通过使用语句IF @j <= @i并在此语句之前增加@i的值来反映这一点。

CREATE FUNCTION [dbo].[f_LevenshteinDistance](@s1 nvarchar(3999), @s2 nvarchar(3999))
    RETURNS int
    AS
    BEGIN
       DECLARE @s1_len  int;
       DECLARE @s2_len  int;
       DECLARE @i       int;
       DECLARE @j       int;
       DECLARE @s1_char nchar;
       DECLARE @c       int;
       DECLARE @c_temp  int;
       DECLARE @cv0     varbinary(8000);
       DECLARE @cv1     varbinary(8000);

       SELECT
          @s1_len = LEN(@s1),
          @s2_len = LEN(@s2),
          @cv1    = 0x0000  ,
          @j      = 1       , 
          @i      = 1       , 
          @c      = 0

       WHILE @j <= @s2_len
          SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1;

          WHILE @i <= @s1_len
             BEGIN
                SELECT
                   @s1_char = SUBSTRING(@s1, @i, 1),
                   @c       = @i                   ,
                   @cv0     = CAST(@i AS binary(2)),
                   @j       = 1;

                SET @i = @i + 1;

                WHILE @j <= @s2_len
                   BEGIN
                      SET @c = @c + 1;

                      IF @j <= @i 
                         BEGIN
                            SET @c_temp = CAST(SUBSTRING(@cv1, @j + @j - 1, 2) AS int) + CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END;
                            IF @c > @c_temp SET @c = @c_temp
                            SET @c_temp = CAST(SUBSTRING(@cv1, @j + @j + 1, 2) AS int) + 1;
                            IF @c > @c_temp SET @c = @c_temp;
                         END;
                      SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1;
                   END;
                SET @cv1 = @cv0;
          END;
       RETURN @c;
    END;

- Marcus Belz

如此编写的代码并不总是能够给出正确的结果。例如，输入 ('jane', 'jeanne') 将返回距离为 3，而实际上距离应该是 2。为了纠正这个问题，需要添加额外的代码，如果 @s1 的长度比 @s2 短，则交换 @s1 和 @s2。 - hatchet - done with SOverflow

0

我为Azure Synapse编写的修改（改用SET而不是SELECT）：

-- =============================================
-- Computes and returns the Levenshtein edit distance between two strings, i.e. the
-- number of insertion, deletion, and sustitution edits required to transform one
-- string to the other, or NULL if @max is exceeded. Comparisons use the case-
-- sensitivity configured in SQL Server (case-insensitive by default).
-- 
-- Based on Sten Hjelmqvist's "Fast, memory efficient" algorithm, described
-- at http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm,
-- with some additional optimizations.
-- =============================================
CREATE FUNCTION [db0].[Levenshtein](
    @s nvarchar(4000)
  , @t nvarchar(4000)
  , @max int
)
RETURNS int
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @distance int = 0 -- return variable
          , @v0 nvarchar(4000)-- running scratchpad for storing computed distances
          , @start int = 1      -- index (1 based) of first non-matching character between the two string
          , @i int, @j int      -- loop counters: i for s string and j for t string
          , @diag int          -- distance in cell diagonally above and left if we were using an m by n matrix
          , @left int          -- distance in cell to the left if we were using an m by n matrix
          , @sChar nchar      -- character at index i from s string
          , @thisJ int          -- temporary storage of @j to allow SELECT combining
          , @jOffset int      -- offset used to calculate starting value for j loop
          , @jEnd int          -- ending value for j loop (stopping point for processing a column)
          -- get input string lengths including any trailing spaces (which SQL Server would otherwise ignore)
          , @sLen int = datalength(@s) / datalength(left(left(@s, 1) + '.', 1))    -- length of smaller string
          , @tLen int = datalength(@t) / datalength(left(left(@t, 1) + '.', 1))    -- length of larger string
          , @lenDiff int      -- difference in length between the two strings
    -- if strings of different lengths, ensure shorter string is in s. This can result in a little
    -- faster speed by spending more time spinning just the inner loop during the main processing.
    IF (@sLen > @tLen) BEGIN
        SET @v0 = @s
        SET @i = @sLen -- temporarily use v0 for swap
        SET @s = @t
        SET @sLen = @tLen
        SET @t = @v0
        SET @tLen = @i
    END
    SET @max = ISNULL(@max, @tLen)
    SET @lenDiff = @tLen - @sLen

    IF @lenDiff > @max RETURN NULL

    -- suffix common to both strings can be ignored
    WHILE(@sLen > 0 AND SUBSTRING(@s, @sLen, 1) = SUBSTRING(@t, @tLen, 1))
        SET @sLen = @sLen - 1
        SET @tLen = @tLen - 1

    IF (@sLen = 0) RETURN @tLen

    -- prefix common to both strings can be ignored
    WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1)) 
        SET @start = @start + 1
    IF (@start > 1) BEGIN
        SET @sLen = @sLen - (@start - 1)
        SET @tLen = @tLen - (@start - 1)

        -- if all of shorter string matches prefix and/or suffix of longer string, then
        -- edit distance is just the delete of additional characters present in longer string
        IF (@sLen <= 0) RETURN @tLen

        SET @s = SUBSTRING(@s, @start, @sLen)
        SET @t = SUBSTRING(@t, @start, @tLen)
    END

    -- initialize v0 array of distances
    SET @v0 = ''
    SET @j = 1

    WHILE (@j <= @tLen) BEGIN
        SET @v0 = @v0 + NCHAR(CASE WHEN @j > @max THEN @max ELSE @j END)
        SET @j = @j + 1
    END

    SET @jOffset = @max - @lenDiff
    SET @i = 1

    WHILE (@i <= @sLen) BEGIN
        SET @distance = @i
        SET @diag = @i - 1
        SET @sChar = SUBSTRING(@s, @i, 1)
             -- no need to look beyond window of upper left diagonal (@i) + @max cells
             -- and the lower right diagonal (@i - @lenDiff) - @max cells
        SET @j = CASE WHEN @i <= @jOffset THEN 1 ELSE @i - @jOffset END
        SET @jEnd = CASE WHEN @i + @max >= @tLen THEN @tLen ELSE @i + @max END

        WHILE (@j <= @jEnd) BEGIN
            -- at this point, @distance holds the previous value (the cell above if we were using an m by n matrix)
            SET @left = UNICODE(SUBSTRING(@v0, @j, 1))
            SET @thisJ = @j

            SET @distance = 
                CASE WHEN (@sChar = SUBSTRING(@t, @j, 1)) THEN @diag                    --match, no change
                     ELSE 1 + CASE WHEN @diag < @left AND @diag < @distance THEN @diag    --substitution
                                   WHEN @left < @distance THEN @left                    -- insertion
                                   ELSE @distance                                        -- deletion
                                   END
                     END
            SET @v0 = STUFF(@v0, @thisJ, 1, NCHAR(@distance))
            SET @diag = @left
            SET @j = case when (@distance > @max) AND (@thisJ = @i + @lenDiff) 
                then @jEnd + 2 
                else @thisJ + 1 end
        END
        SET @i = CASE WHEN @j > @jEnd + 1 THEN @sLen + 1 ELSE @i + 1 END
    END
    RETURN CASE WHEN @distance <= @max THEN @distance ELSE NULL END
END

- user1390375

你至少可以给你从中获取代码的人以信用...... - Alex

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hatchet - done with SOverflow · Accepted Answer

我使用了标准的Levenshtein编辑距离函数，并进行了几处优化以提高速度，比我知道的其他版本都要快。在两个字符串在其开头（共享前缀）和结尾（共享后缀）有相同字符，并且字符串很大并且提供了最大编辑距离的情况下，速度的提升是显著的。例如，当输入为两个非常相似的4000个字符的字符串，并指定最大编辑距离为2时，这比接受答案中的edit_distance_within函数快近三个数量级，返回答案需要0.073秒（73毫秒）而不是55秒。它还具有内存效率，使用空间等于两个输入字符串中较大的一个加上一些常量空间。它使用一个表示列的单个nvarchar“数组”，并在其中进行所有计算，再加上一些辅助int变量。

优化：

跳过处理共享前缀和/或后缀
如果较大的字符串以整个较小的字符串开始或结束，则提前返回
如果大小差异保证超出了最大距离，则提前返回
仅使用表示矩阵中一列的单个数组（实现为nvarchar）
当给定最大距离时，时间复杂度从（len1 * len2）变为（min（len1，len2）），即线性
当给定最大距离时，一旦知道无法达到最大距离限制，就提前返回

这是代码（于1/20/2014更新以进一步提高速度）：

-- =============================================
-- Computes and returns the Levenshtein edit distance between two strings, i.e. the
-- number of insertion, deletion, and sustitution edits required to transform one
-- string to the other, or NULL if @max is exceeded. Comparisons use the case-
-- sensitivity configured in SQL Server (case-insensitive by default).
-- 
-- Based on Sten Hjelmqvist's "Fast, memory efficient" algorithm, described
-- at http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm,
-- with some additional optimizations.
-- =============================================
CREATE FUNCTION [dbo].[Levenshtein](
    @s nvarchar(4000)
  , @t nvarchar(4000)
  , @max int
)
RETURNS int
WITH SCHEMABINDING
AS
BEGIN
    DECLARE @distance int = 0 -- return variable
          , @v0 nvarchar(4000)-- running scratchpad for storing computed distances
          , @start int = 1      -- index (1 based) of first non-matching character between the two string
          , @i int, @j int      -- loop counters: i for s string and j for t string
          , @diag int          -- distance in cell diagonally above and left if we were using an m by n matrix
          , @left int          -- distance in cell to the left if we were using an m by n matrix
          , @sChar nchar      -- character at index i from s string
          , @thisJ int          -- temporary storage of @j to allow SELECT combining
          , @jOffset int      -- offset used to calculate starting value for j loop
          , @jEnd int          -- ending value for j loop (stopping point for processing a column)
          -- get input string lengths including any trailing spaces (which SQL Server would otherwise ignore)
          , @sLen int = datalength(@s) / datalength(left(left(@s, 1) + '.', 1))    -- length of smaller string
          , @tLen int = datalength(@t) / datalength(left(left(@t, 1) + '.', 1))    -- length of larger string
          , @lenDiff int      -- difference in length between the two strings
    -- if strings of different lengths, ensure shorter string is in s. This can result in a little
    -- faster speed by spending more time spinning just the inner loop during the main processing.
    IF (@sLen > @tLen) BEGIN
        SELECT @v0 = @s, @i = @sLen -- temporarily use v0 for swap
        SELECT @s = @t, @sLen = @tLen
        SELECT @t = @v0, @tLen = @i
    END
    SELECT @max = ISNULL(@max, @tLen)
         , @lenDiff = @tLen - @sLen
    IF @lenDiff > @max RETURN NULL

    -- suffix common to both strings can be ignored
    WHILE(@sLen > 0 AND SUBSTRING(@s, @sLen, 1) = SUBSTRING(@t, @tLen, 1))
        SELECT @sLen = @sLen - 1, @tLen = @tLen - 1

    IF (@sLen = 0) RETURN @tLen

    -- prefix common to both strings can be ignored
    WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1)) 
        SELECT @start = @start + 1
    IF (@start > 1) BEGIN
        SELECT @sLen = @sLen - (@start - 1)
             , @tLen = @tLen - (@start - 1)

        -- if all of shorter string matches prefix and/or suffix of longer string, then
        -- edit distance is just the delete of additional characters present in longer string
        IF (@sLen <= 0) RETURN @tLen

        SELECT @s = SUBSTRING(@s, @start, @sLen)
             , @t = SUBSTRING(@t, @start, @tLen)
    END

    -- initialize v0 array of distances
    SELECT @v0 = '', @j = 1
    WHILE (@j <= @tLen) BEGIN
        SELECT @v0 = @v0 + NCHAR(CASE WHEN @j > @max THEN @max ELSE @j END)
        SELECT @j = @j + 1
    END

    SELECT @jOffset = @max - @lenDiff
         , @i = 1
    WHILE (@i <= @sLen) BEGIN
        SELECT @distance = @i
             , @diag = @i - 1
             , @sChar = SUBSTRING(@s, @i, 1)
             -- no need to look beyond window of upper left diagonal (@i) + @max cells
             -- and the lower right diagonal (@i - @lenDiff) - @max cells
             , @j = CASE WHEN @i <= @jOffset THEN 1 ELSE @i - @jOffset END
             , @jEnd = CASE WHEN @i + @max >= @tLen THEN @tLen ELSE @i + @max END
        WHILE (@j <= @jEnd) BEGIN
            -- at this point, @distance holds the previous value (the cell above if we were using an m by n matrix)
            SELECT @left = UNICODE(SUBSTRING(@v0, @j, 1))
                 , @thisJ = @j
            SELECT @distance = 
                CASE WHEN (@sChar = SUBSTRING(@t, @j, 1)) THEN @diag                    --match, no change
                     ELSE 1 + CASE WHEN @diag < @left AND @diag < @distance THEN @diag    --substitution
                                   WHEN @left < @distance THEN @left                    -- insertion
                                   ELSE @distance                                        -- deletion
                                END    END
            SELECT @v0 = STUFF(@v0, @thisJ, 1, NCHAR(@distance))
                 , @diag = @left
                 , @j = case when (@distance > @max) AND (@thisJ = @i + @lenDiff) then @jEnd + 2 else @thisJ + 1 end
        END
        SELECT @i = CASE WHEN @j > @jEnd + 1 THEN @sLen + 1 ELSE @i + 1 END
    END
    RETURN CASE WHEN @distance <= @max THEN @distance ELSE NULL END
END

如此函数的评论所述，字符比较的大小写敏感性将遵循当前生效的排序规则。默认情况下，SQL Server的排序规则会导致不区分大小写的比较。

要始终进行大小写敏感的比较，一种方法是在比较两个字符串的两个位置添加特定的排序规则。但是，我尚未对此进行彻底测试，特别是当数据库使用非默认排序规则时可能会产生副作用。

以下是如何更改这两行以强制进行大小写敏感的比较：

    -- prefix common to both strings can be ignored
    WHILE (@start < @sLen AND SUBSTRING(@s, @start, 1) = SUBSTRING(@t, @start, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS)

并且

            SELECT @distance = 
                CASE WHEN (@sChar = SUBSTRING(@t, @j, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS) THEN @diag                    --match, no change