创建大型集合 - 需要减少在垃圾回收上花费的时间

Question

创建大型集合 - 需要减少在垃圾回收上花费的时间

optimizationhaskellgarbage-collectionsetghc

3

这个程序创建了一个非常大的集合来寻找哈希函数的冲突。有没有方法可以减少在垃圾回收中花费的时间？使用 +RTS -s 命令报告花费40%以上的时间用于垃圾回收。

使用示例:

./program 0 1000000 +RTS -s
./program 145168473 10200000 +RTS -s

有没有更好的算法或数据结构可供使用？

{-# LANGUAGE OverloadedStrings #-}

import System.Environment
import Control.Monad
import Crypto.Hash.SHA256

import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Lazy.Char8 as BL
import Data.Char
import Data.Int
import Data.Bits
import Data.Binary
import Data.Set as Set
import Data.List
import Numeric

str2int :: (Integral a) => B.ByteString -> a
str2int bs = B.foldl (\a w -> (a * 256)+(fromIntegral $ ord w)) 0 bs

t50 :: Int64 -> Int64
t50 i = let h = hash $ B.concat $ BL.toChunks $ encode i
        in
          (str2int $ B.drop 25 h) .&. 0x3ffffffffffff

sha256 :: Int64 -> B.ByteString
sha256 i = hash $ B.concat $ BL.toChunks $ encode i

-- firstCollision :: Ord b => (a -> b) -> [a] -> Maybe a
firstCollision f xs = go f Set.empty xs
  where
    -- go :: Ord b => (a -> b) -> Set b -> [a] -> Maybe a
    go _ _ []     = Nothing
    go f s (x:xs) = let y = f x
                    in
                      if y `Set.member` s
                        then Just x
                        else go f (Set.insert y s) xs

showHex2 i
  | i < 16    = "0" ++ (showHex i "")
  | otherwise = showHex i ""

prettyPrint :: B.ByteString -> String
prettyPrint = concat . (Data.List.map showHex2) . (Data.List.map ord) . B.unpack


showhash inp =
  let  h = sha256 inp
       x = B.concat $ BL.toChunks $ encode inp
   in do putStrLn $ "  - input: " ++ (prettyPrint x) ++ " -- " ++ (show inp)
         putStrLn $ "  -  hash: " ++ (prettyPrint h)

main = do
         args <- getArgs
         let a = (read $ args !! 0) :: Int64
             b = (read $ args !! 1) :: Int64
             c = firstCollision t [a..(a+b)]
             t = t50
         case c of
           Nothing -> putStrLn "No collision found"
           Just x  -> do let h = t x
                         putStrLn $ "Found collision at " ++ (show x)
                         showhash x
                         let first = find (\x -> (t x) == h) [a..(a+b)]
                          in case first of
                               Nothing -> putStrLn "oops -- failed to find hash"
                               Just x0 -> do putStrLn $ "first instance at " ++ (show x0)
                                             showhash x0

- ErikR

3

使用堆分析来找出发生了什么。 - augustss

1

你尝试过使用例如 -RTS -A100M 的方式增加分配区域到 100MB 吗？这有时可以通过减少 GC 运行的频率来帮助提高程序性能。对于像这样大部分生成数据需要保留一段时间的程序，这可能会产生很大的影响。 - John L

firstCollision的worker中s应该是严格的。 - luqui

3个回答

2

您正在频繁使用binary包构造ByteString（如果您想避免使用懒惰块进行to/from操作，可以使用cereal）。如果您深入了解Builder单子的内部机制，您会发现它的默认初始大小约为32k。考虑到您只需要8个字节，这可能会给垃圾收集器带来更多的压力。

由于您实际上只是在使用binary进行编码，因此您可以自己编写类似以下的代码：

encodeInt64 :: Int64 -> B.ByteString
encodeInt64 x = 
  let 
    go :: Int -> Maybe (Word8, Int)
    go i 
      | i < 0     = Nothing
      | otherwise = 
        let 
          w :: Word8
          w = fromIntegral (x `shiftR` i)
        in Just (w, i-8)
  in fst $ B.unfoldrN 8 go 56

我敢说，您甚至可以更好地将字节直接插入缓冲区。

上面的内容是一件事情，另一个与GC无关的要点是您正在使用标准的Data.Set实现，您可以在unordered-containers中找到稍微更好的性能与Data.HashSet。

最后一点，也是Don提到的，您可以使用-A200M（或类似的）请求更大的分配区域。

通过所有上述修改（您自己的编码器，使用Data.HashSet和-A200M），您的代码运行时间从我的机器上的7.397秒降至3.474秒，其中% GC时间分别为52.9％和21.2％。

因此，在您的方法的Big-O意义上，您没有做错任何事情，但是有一些常数可以稍微降低一些！

- ScottWest

1

我不确定。但是，以下是一些分析器输出，以防有人能够从中构建出真正的答案：

以下是堆剖面图（从运行时使用+RTS -hT）

heap profile

我认为你在firstCollision中由于Set.insert的非强制性评估而积累了thunks。但是，从绝对意义上来说，内存分配是如此之小，以至于我不确定它是否是真正的罪魁祸首-请参见下面的内容。

这是来自分析器的输出（使用-prof -fprof-auto编译，使用+RTS -p运行）：

COST CENTRE         MODULE  %time %alloc

firstCollision.go   Main     49.4    2.2
t50.h               Main     39.5   97.5
str2int             Main      5.4    0.0
firstCollision.go.y Main      3.4    0.0
t50                 Main      1.1    0.0

基本上所有的内存分配都来自于本地等效的序列化/哈希管道sha256的h，其中似乎有很多中间数据结构的构建。

有经验的人能更准确地确定问题吗？

- jtobin

2

如果存在未评估的thunks，它们应该显示为类型“THUNK”。对于这个算法来说，这个内存配置看起来相当不错。 - John L

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Don Stewart · Accepted Answer

你注意到了，GC统计报告显示生产力较低：

  44,184,375,988 bytes allocated in the heap
   1,244,120,552 bytes copied during GC
      39,315,612 bytes maximum residency (42 sample(s))
         545,688 bytes maximum slop
             109 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     81400 colls,     0 par    2.47s    2.40s     0.0000s    0.0003s
  Gen  1        42 colls,     0 par    1.06s    1.08s     0.0258s    0.1203s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    4.58s  (  4.63s elapsed)
  GC      time    3.53s  (  3.48s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    8.11s  (  8.11s elapsed)

  %GC     time      43.5%  (42.9% elapsed)

  Alloc rate    9,651,194,755 bytes per MUT second

  Productivity  56.5% of total user, 56.4% of total elapsed

最明显的第一步是增加GC默认区域，尝试消除调整大小的需要。例如，一个技巧是增加-A区域（您可以使用像GC tune这样的工具找到适合您的程序的正确设置）。

  $ ./A ... +RTS -s -A200M

  Total   time    7.89s  (  7.87s elapsed)

  %GC     time      26.1%  (26.5% elapsed)

  Alloc rate    7,581,233,460 bytes per MUT second

  Productivity  73.9% of total user, 74.1% of total elapsed

所以我们缩短了四分之一秒的时间，但生产率增加到了75％。现在我们应该看一下堆配置文件：

enter image description here

这显示了集合及其Int值的线性增长。虽然这是您的算法指定的，但只要您保留所有结果，我认为您没有太多可以做的事情。