Floyd-Warshall在Haskell中的性能 - 修复空间泄漏

8

我希望使用Haskell语言中的Vector实现Floyd-Warshall全源最短路径算法,以期获得良好的性能表现。

这个实现非常直接,但是我们使用的是一个二维向量而不是一个三维的 |V|×|V|×|V| 矩阵,因为我们只会读取前一个k值。

因此,该算法实际上只是一系列步骤,其中传入一个2D向量并生成一个新的2D向量。最终的2D向量包含所有节点(i,j)之间的最短路径。

我的直觉告诉我,在每个步骤之前,确保先评估前一个2D向量非常重要,所以我在fw函数的prev参数和严格的foldl'上使用了BangPatterns

{-# Language BangPatterns #-}

import           Control.DeepSeq
import           Control.Monad       (forM_)
import           Data.List           (foldl')
import qualified Data.Map.Strict     as M
import           Data.Vector         (Vector, (!), (//))
import qualified Data.Vector         as V
import qualified Data.Vector.Mutable as V hiding (length, replicate, take)

type Graph = Vector (M.Map Int Double)
type TwoDVector = Vector (Vector Double)

infinity :: Double
infinity = 1/0

-- calculate shortest path between all pairs in the given graph, if there are
-- negative cycles, return Nothing
allPairsShortestPaths :: Graph -> Int -> Maybe TwoDVector
allPairsShortestPaths g v =
  let initial = fw g v V.empty 0
      results = foldl' (fw g v) initial [1..v]
  in if negCycle results
        then Nothing
        else Just results
  where -- check for negative elements along the diagonal
        negCycle a = any not $ map (\i -> a ! i ! i >= 0) [0..(V.length a-1)]

-- one step of the Floyd-Warshall algorithm
fw :: Graph -> Int -> TwoDVector -> Int -> TwoDVector
fw g v !prev k = V.create $ do                                           -- ← bang
  curr <- V.new v
  forM_ [0..(v-1)] $ \i ->
    V.write curr i $ V.create $ do
      ivec <- V.new v
      forM_ [0..(v-1)] $ \j -> do
        let d = distance g prev i j k
        V.write ivec j d
      return ivec
  return curr

distance :: Graph -> TwoDVector -> Int -> Int -> Int -> Double
distance g _ i j 0 -- base case; 0 if same vertex, edge weight if neighbours
  | i == j    = 0.0
  | otherwise = M.findWithDefault infinity j (g ! i)
distance _ a i j k = let c1 = a ! i ! j
                        c2 = (a ! i ! (k-1))+(a ! (k-1) ! j)
                        in min c1 c2

然而,当使用有47978条边的1000个节点图运行此程序时,情况并不理想。内存使用率非常高,程序运行时间过长。该程序是使用ghc -O2编译的。

我为了进行性能分析重建了程序,并将迭代次数限制为50次:

 results = foldl' (fw g v) initial [1..50]

我随后使用 +RTS -p -hc+RTS -p -hd 运行了该程序:

这很有趣,但我猜它表明程序正在累积大量的thunks。这不好。

好的,经过一些试验,我在 fw 中添加了一个 deepseq 来确保 prev 实际上已被评估:

let d = prev `deepseq` distance g prev i j k

现在情况好多了,我可以顺利地运行程序并且内存使用保持不变。很明显,对于prev参数的bang操作是不够的。
为了与之前的图形进行比较,这里是添加deepseq后50次迭代的内存使用情况: 好了,现在情况有所改善,但我仍有一些问题:
  1. 这是解决空间泄漏的正确方法吗?我感觉插入一个deepseq有点丑陋?
  2. 我在这里使用Vector的方式是否惯用/正确?我为每次迭代构建一个全新的向量,并希望垃圾收集器会删除旧的Vector
  3. 是否还有其他方法可以通过这种方法使运行速度更快?
参考一下,这里是graph.txt: http://sebsauvage.net/paste/?45147f7caf8c5f29#7tiCiPovPHWRm1XNvrSb/zNl3ujF3xB3yehrxhEdVWw= 这是main:
main = do
  ls <- fmap lines $ readFile "graph.txt"
  let numVerts = head . map read . words . head $ ls
  let edges = map (map read . words) (tail ls)
  let g = V.create $ do
        g' <- V.new numVerts
        forM_ [0..(numVerts-1)] (\idx -> V.write g' idx M.empty)
        forM_ edges $ \[f,t,w] -> do
          -- subtract one from vertex IDs so we can index directly
          curr <- V.read g' (f-1)
          V.write g' (f-1) $ M.insert (t-1) (fromIntegral w) curr
        return g'
  let a = allPairsShortestPaths g numVerts
  case a of
    Nothing -> putStrLn "Negative cycle detected."
    Just a' -> do
      putStrLn  $ "The shortest, shortest path has length "
              ++ show ((V.minimum . V.map V.minimum) a')

你尝试过使用可变向量将你的 foldl'forM_ 计算重写为显式循环吗?(例如在此处的 test0 中所做的,尽管使用的是数组而不是向量。并且在此处,用循环代替通常的 forM)。 - Will Ness
@WillNess:不,我尝试的唯一一件事情就是用一个带有严格累加器的尾递归函数替换 foldl',但似乎没有什么效果。看到你链接的两个示例都充满了 unsafe* 函数,有点让人沮丧 - 我真的希望在不使用这些函数的情况下可以实现合理的性能。 :-) - beta
1
你应该使用非装箱向量。这将使内容被简单地插入到向量中而强制执行。那些示例中的不安全操作只是为了删除边界检查。 - Carl
你的TwoDVector只是矩阵,对吗?你考虑过使用Repa吗?Simon Marlow在几个不同的上下文中都实现了FW算法作为示例,例如这个链接:http://chimera.labs.oreilly.com/books/1230000000929/ch05.html#sec_par-repa-array-ops - Chad Scherrer
@Carl:是的,我认为你说得对。但是我不能创建未装箱向量的未装箱向量,所以我必须将数据存储为一个向量或更改为其他类型而不是“Vector”。 - beta
显示剩余2条评论
1个回答

5

首先,进行一些通用代码清理:

在您的fw函数中,您明确分配并填充可变向量。然而,有一个预制函数可以完成这个精确的目的,即generate。因此,fw可以重写为

V.generate v (\i -> V.generate v (\j -> distance g prev i j k))

同样地,图表生成代码可以使用replicateaccum进行替换:

let parsedEdges = map (\[f,t,w] -> (f - 1, (t - 1, fromIntegral w))) edges
let g = V.accum (flip (uncurry M.insert)) (V.replicate numVerts M.empty) parsedEdges

请注意,这将完全消除所有突变需求,而不会损失任何性能。
现在,针对实际问题:
  1. In my experience, deepseq is very useful, but only as quick fix to space leaks like this one. The fundamental problem is not that you need to force the results after you've produced them. Instead, the use of deepseq implies that you should have been building the structure more strictly in the first place. In fact, if you add a bang pattern in your vector creation code like so:

    let !d = distance g prev i j k
    

    Then the problem is fixed without deepseq. Note that this doesn't work with the generate code, because, for some reason (I might create a feature request for this), vector does not provide strict functions for boxed vectors. However, when I get to unboxed vectors in answer to question 3, which are strict, both approaches work without strictness annotations.

  2. As far as I know, the pattern of repeatedly generating new vectors is idiomatic. The only thing not idiomatic is the use of mutability - except when they are strictly necessary, mutable vectors are generally discouraged.

  3. There are a couple of things to do:

    • Most simply, you can replace Map Int with IntMap. As that isn't really the slow point of the function, this doesn't matter too much, but IntMap can be much faster for heavy workloads.

    • You can switch to using unboxed vectors. Although the outer vector has to remain boxed, as vectors of vectors can't be unboxed, the inner vector can be. This also solves your strictness problem - because unboxed vectors are strict in their elements, you don't get a space leak. Note that on my machine, this improves the performance from 4.1 seconds to 1.3 seconds, so the unboxing is very helpful.

    • You can flatten the vector into a single one and use multiplication and division to switch between two dimensional indicies and one dimentional indicies. I don't recommend this, as it is a bit involved, quite ugly, and, due to the division, actually slows down the code on my machine.

    • You can use repa. This has the huge advantage of automatically parallelizing your code. Note that, since repa flattens its arrays and apparently doesn't properly get rid of the divisions needed to fill nicely (it's possible to do with nested loops, but I think it uses a single loop and a division), it has the same performance penalty as I mentioned above, bringing the runtime from 1.3 seconds to 1.8. However, if you enable parallelism and use a multicore machine, you start seeing some benifits. Unfortunately, you current test case is too tiny to see much benifit, so, on my 6 core machine, I see it drop back down to 1.2 seconds. If I up the size back to [1..v] instead of [1..50], the parallelism brings it from 32 seconds to 13. Presumably, if you give this program a larger input, you might see more benifit.

      If you're interested, I've posted my repa-ified version here.

    • EDIT: Use -fllvm. Testing on my computer, using repa, I get 14.7 seconds without parallelism, which is almost as good as without -fllvm and with parallelism. In general, LLVM can just handle array based code like this very well.


非常感谢!我会在接下来的几天里仔细研究这些内容 - 这里有很多很棒的信息。 :) - beta

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接