在Haskell中的编辑距离算法 - 性能调优

Question

在Haskell中的编辑距离算法 - 性能调优

performancehaskell

13

我正在尝试在Haskell中实现Levenshtein距离(或编辑距离)，但是当字符串长度增加时，性能会急剧下降。

我对Haskell还不太熟悉，如果您能给我一些建议，我会很感激。我已经尝试过"预计算"值(即inits)，但由于它没有改变任何东西，所以我撤销了那个更改。

我知道Hackage上已经有一个editDistance实现，但我需要它可以处理任意令牌的列表，而不仅仅是字符串。此外，与我的版本相比，我觉得它有点复杂。

这就是代码:

-- 两个列表之间的标准Levenshtein距离
editDistance      :: Eq a => [a] -> [a] -> Int
editDistance s1 s2 = editDistance' 1 1 1 s1 s2
-- 加权Levenshtein距离
-- ins, sub和del是各种操作的成本
editDistance'      :: Eq a => Int -> Int -> Int -> [a] -> [a] -> Int
editDistance' _ _ ins s1 [] = ins * length s1 
editDistance' _ _ ins [] s2 = ins * length s2 
editDistance' del sub ins s1 s2  
    | last s1 == last s2 = editDistance' del sub ins (init s1) (init s2)
    | otherwise          = minimum [ editDistance' del sub ins s1 (init s2)        + del -- 删除 
                                   , editDistance' del sub ins (init s1) (init s2) + sub -- 替换
                                   , editDistance' del sub ins (init s1) s2        + ins -- 插入
                                   ]

看起来这是一个正确的实现，至少它给出的结果与此online tool完全相同。

提前感谢您的帮助！如果您需要任何其他信息，请告诉我。

问候， bzn

- bzn

6个回答

5

你需要记忆化 editDistance'。有许多方法可以实现，例如，使用递归定义的数组。

- augustss

当我给你点赞时，为什么会出现一只独角兽和一个气球？这是在ICFP发表太多论文的后果吗？ - Thomas M. DuBuisson

@TomMD 这是一个 Stack Overflow 愚人节礼物。 - sclv

谢谢回答！我想我会使用TomMD提出的Memo-Monad。 - bzn

2

如前所述，您需要的是记忆化技术。此外，您正在从右到左查看编辑距离，这在处理字符串时并不高效，而且无论方向如何，编辑距离都是相同的。也就是说：editDistance (reverse a) (reverse b) == editDistance a b

要解决记忆化部分，有很多库可以帮助您。在下面的示例中，我选择了MemoTrie，因为它易于使用并且在这里表现良好。

import Data.MemoTrie(memo2)

editDistance' del sub ins = memf
  where
   memf = memo2 f
   f s1     []     = ins * length s1
   f []     s2     = ins * length s2
   f (x:xs) (y:ys)
     | x == y  = memf xs ys
     | otherwise = minimum [ del + memf xs (y:ys),
                             sub + memf (x:xs) ys,
                             ins + memf xs ys]

你可以看到，你所需要做的就是添加记忆化。其余部分与原来相同，只不过我们从列表的开头开始而不是结尾。

- HaskellElephant

但是为什么你在 f (x:xs) (y:ys) 的第一个等式中使用 f 而不是 memf？ - Rotsor

@Rotsor，这是一个错误（虽然它只会在运行时间中增加O(min(length xs,length ys))的因素）。将其更改为在所有递归情况下使用memf。 - HaskellElephant

1

这个版本比那些记忆化版本要快得多，但我仍然希望它能更快一些。适用于100个字符长的字符串。我编写时考虑了其他距离（更改init函数和cost），并使用了经典的动态规划数组技巧。长行可以转换为具有顶部“do”的单独函数，但我喜欢这种方式。

import Data.Array.IO
import System.IO.Unsafe

editDistance = dist ini med

dist :: (Int -> Int -> Int) -> (a -> a -> Int ) -> [a] -> [a] -> Int
dist i f a b  = unsafePerformIO $ distM i f a b

-- easy to create other distances 
ini i 0 = i
ini 0 j = j
ini _ _ = 0
med a b = if a == b then 0 else 2


distM :: (Int -> Int -> Int) -> (a -> a -> Int) -> [a] -> [a] -> IO Int
distM ini f a b = do
        let la = length a
        let lb = length b

        arr <- newListArray ((0,0),(la,lb)) [ini i j | i<- [0..la], j<-[0..lb]] :: IO (IOArray (Int,Int) Int)

-- all on one line
        mapM_ (\(i,j) -> readArray arr (i-1,j-1) >>= \ld -> readArray arr (i-1,j) >>= \l -> readArray arr (i,j-1) >>= \d-> writeArray arr (i,j) $ minimum [l+1,d+1, ld + (f (a !! (i-1) ) (b !! (j-1))) ] ) [(i,j)| i<-[1..la], j<-[1..lb]]

        readArray arr (la,lb)

- tolysz

1

所有内容都放在一行不是一个很好的布局...而且unsafePerformIO并不是必要的，绝对不是理想的选择——那段代码可以使用ST单子进行重写，并且只需要进行最小的更改。 - Oliver

1

人们建议您使用通用的记忆化库，但对于定义Levenshtein距离这样简单的任务，普通的动态规划已经足够了。一个非常简单的多态基于列表的实现：

distance s t = 
    d !!(length s)!!(length t) 
    where d = [ [ dist m n | n <- [0..length t] ] | m <- [0..length s] ]
          dist i 0 = i
          dist 0 j = j
          dist i j = minimum [ d!!(i-1)!!j+1
                             , d!!i!!(j-1)+1
                             , d!!(i-1)!!(j-1) + (if s!!(i-1)==t!!(j-1) 
                                                  then 0 else 1) 
                             ]

或者如果您需要在长序列上获得真正的速度，可以使用可变数组：

import Data.Array
import qualified Data.Array.Unboxed as UA
import Data.Array.ST
import Control.Monad.ST


-- Mutable unboxed and immutable boxed arrays
distance :: Eq a => [a] -> [a] -> Int
distance s t = d UA.! (ls , lt)
    where s' = array (0,ls) [ (i,x) | (i,x) <- zip [0..] s ]
          t' = array (0,lt) [ (i,x) | (i,x) <- zip [0..] t ]
          ls = length s
          lt = length t
          (l,h) = ((0,0),(length s,length t))
          d = runSTUArray $ do
                m <- newArray (l,h) 0 
                for_ [0..ls] $ \i -> writeArray m (i,0) i
                for_ [0..lt] $ \j -> writeArray m (0,j) j
                for_ [1..lt] $ \j -> do
                              for_ [1..ls] $ \i -> do
                                  let c = if s'!(i-1)==t'! (j-1) 
                                          then 0 else 1
                                  x <- readArray m (i-1,j)
                                  y <- readArray m (i,j-1)
                                  z <- readArray m (i-1,j-1)
                                  writeArray m (i,j) $ minimum [x+1, y+1, z+c ]
                return m

for_ xs f =  mapM_ f xs

- Grzegorz Chrupała

1

我知道 Hackage 上已经有一个 editDistance 的实现，但我需要它能够处理任意令牌的列表，而不仅仅是字符串。

这些令牌数量是有限的吗？我建议你尝试设计一个从令牌到字符的映射。毕竟，你可以使用10,646个字符。

- Dan Burton

谢谢，但现在我打算使用自己的解决方案，因为像TomMD建议的那样对其进行调整应该足够快了 - 这毕竟是我所需要的。 :P - bzn

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Thomas M. DuBuisson · Accepted Answer

虽然这是一个糟糕的算法（应该采用记忆化，我将在第二个部分中介绍），但忽略这一点...

使用O（1）原语而不是O（n）

一个问题是你使用了很多O（n）的列表调用（haskell列表是单向链表）。更好的数据结构会给你O（1）操作，我使用了vector：

import qualified Data.Vector as V

-- standard levenshtein distance between two lists
editDistance      :: Eq a => [a] -> [a] -> Int
editDistance s1 s2 = editDistance' 1 1 1 (V.fromList s1) (V.fromList s2)

-- weighted levenshtein distance
-- ins, sub and del are the costs for the various operations
editDistance'      :: Eq a => Int -> Int -> Int -> V.Vector a -> V.Vector a -> Int
editDistance' del sub ins s1 s2
  | V.null s2 = ins * V.length s1
  | V.null s1 = ins * V.length s2
  | V.last s1 == V.last s2 = editDistance' del sub ins (V.init s1) (V.init s2)
  | otherwise            = minimum [ editDistance' del sub ins s1 (V.init s2)        + del -- deletion 
                                   , editDistance' del sub ins (V.init s1) (V.init s2) + sub -- substitution
                                   , editDistance' del sub ins (V.init s1) s2        + ins -- insertion
                                   ]

对于列表而言，O(n) 操作包括 init、length 和 last（尽管init至少可以懒惰地执行）。但是在使用 Vector 时，所有这些操作都是 O(1)。

虽然真正的基准测试应该使用Criterion，但以下是一个快速而简单的基准测试：

str2 = replicate 15 'a' ++ replicate 25 'b'
str1 = replicate 20 'a' ++ replicate 20 'b'
main = print $ editDistance str1 str2

这段代码展示了向量版本所需的时间为0.09秒，而字符串版本需要1.6秒，因此我们省下了一个数量级，甚至还没有看你的editDistance算法。

现在考虑记忆化结果怎么办？

更大的问题显然是需要进行记忆化。我把这当作学习monad-memo包的机会 - 天哪，真是太棒了！只需要满足额外的限制（你需要Ord a），你就可以轻松地获得记忆化功能。代码如下：

import qualified Data.Vector as V
import Control.Monad.Memo

-- standard levenshtein distance between two lists
editDistance      :: (Eq a, Ord a) => [a] -> [a] -> Int
editDistance s1 s2 = startEvalMemo $ editDistance' (1, 1, 1, (V.fromList s1), (V.fromList s2))

-- weighted levenshtein distance
-- ins, sub and del are the costs for the various operations
editDistance' :: (MonadMemo (Int, Int, Int, V.Vector a, V.Vector a) Int m, Eq a) => (Int, Int, Int, V.Vector a, V.Vector a) -> m Int
editDistance' (del, sub, ins, s1, s2)
  | V.null s2 = return $ ins * V.length s1
  | V.null s1 = return $ ins * V.length s2
  | V.last s1 == V.last s2 = memo editDistance' (del, sub, ins, (V.init s1), (V.init s2))
  | otherwise = do
        r1 <- memo editDistance' (del, sub, ins, s1, (V.init s2))
        r2 <- memo editDistance' (del, sub, ins, (V.init s1), (V.init s2))
        r3 <- memo editDistance' (del, sub, ins, (V.init s1), s2)
        return $ minimum [ r1 + del -- deletion 
                         , r2 + sub -- substitution
                         , r3 + ins -- insertion
                                   ]

你看到了记忆化需要一个单一的"键"（参见MonadMemo类）吗？我将所有参数打包成一个又大又丑的元组。它还需要一个"value"，即你得到的Int。然后，只需使用"memo"函数插入和运行你想要记忆化的值即可。

为了进行基准测试，我使用了一个更短但距离较远的字符串：

$ time ./so  # the memoized vector version
12

real    0m0.003s

$ time ./so3  # the non-memoized vector version
12

real    1m33.122s

不要考虑运行非记忆化的字符串版本，我估计至少需要15分钟。至于我，现在我爱上了monad-memo - 感谢Eduard的这个包！

编辑：在记忆化版本中，String和Vector之间的差异并不大，但当距离达到200左右时，差异仍会增长到2倍左右，因此仍然值得使用。

编辑：也许我应该解释一下为什么“显然”更大的问题是记忆化结果。那么，如果您看一下原始算法的核心：

 [ editDistance' ... s1          (V.init s2)  + del 
 , editDistance' ... (V.init s1) (V.init s2) + sub
 , editDistance' ... (V.init s1) s2          + ins]

很明显调用editDistance' s1 s2会导致3个调用editDistance'...每个都会再次调用editDistance'三次...接着又是三次...指数级的爆炸！幸运的是大部分调用是相同的！例如（使用-->表示“调用”，eD表示editDistance'）：

eD s1 s2  --> eD s1 (init s2)             -- The parent
            , eD (init s1) s2
            , eD (init s1) (init s2)
eD (init s1) s2 --> eD (init s1) (init s2)         -- The first "child"
                  , eD (init (init s1)) s2
                  , eD (init (init s1)) (init s2) 
eD s1 (init s2) --> eD s1 (init (init s2))
                  , eD (init s1) (init s2)
                  , eD (init s1) (init (init s2))

仅考虑父节点和两个直接子节点，我们可以看到调用ed (init s1) (init s2)三次。另一个子节点也与父节点共享调用，并且所有子节点彼此之间（以及它们的子节点）共享许多调用，这启示了蒙提·派森（Monty Python）幽默短剧。

制作类似于runMemo的函数，返回使用了缓存结果的数量，可能是一项有趣、有益的练习。