如何重构这个Haskell随机字节输出器?

7

我试图在Haskell内快速生成随机数据,但是当我尝试使用任何惯用的方法时,速度变慢且GC开销大。

以下是简短的代码:

import qualified System.Random.Mersenne as RM
import qualified Data.ByteString.Lazy as BL
import qualified System.IO as SI
import Data.Word

main = do
    r <- RM.newMTGen  Nothing :: IO RM.MTGen
    rnd <- RM.randoms  r :: IO [Word8]
    BL.hPutStr SI.stdout $ BL.pack rnd

以下是快速的代码:

import qualified System.Random.Mersenne as RM
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as BL
import qualified Data.Binary.Put as DBP
import qualified System.IO as SI
import Data.List
import Control.Monad (void, forever)
import Data.Word

main = do
    r <- RM.newMTGen  Nothing :: IO RM.MTGen
    forever $ do
        x0 <- RM.random r :: IO Word32
        x1 <- RM.random r :: IO Word32
        x2 <- RM.random r :: IO Word32
        x3 <- RM.random r :: IO Word32
        x4 <- RM.random r :: IO Word32
        x5 <- RM.random r :: IO Word32
        x6 <- RM.random r :: IO Word32
        x7 <- RM.random r :: IO Word32
        x8 <- RM.random r :: IO Word32
        x9 <- RM.random r :: IO Word32
        xA <- RM.random r :: IO Word32
        xB <- RM.random r :: IO Word32
        xC <- RM.random r :: IO Word32
        xD <- RM.random r :: IO Word32
        xE <- RM.random r :: IO Word32
        xF <- RM.random r :: IO Word32
        c0 <- RM.random r :: IO Word32
        c1 <- RM.random r :: IO Word32
        c2 <- RM.random r :: IO Word32
        c3 <- RM.random r :: IO Word32
        c4 <- RM.random r :: IO Word32
        c5 <- RM.random r :: IO Word32
        c6 <- RM.random r :: IO Word32
        c7 <- RM.random r :: IO Word32
        c8 <- RM.random r :: IO Word32
        c9 <- RM.random r :: IO Word32
        cA <- RM.random r :: IO Word32
        cB <- RM.random r :: IO Word32
        cC <- RM.random r :: IO Word32
        cD <- RM.random r :: IO Word32
        cE <- RM.random r :: IO Word32
        cF <- RM.random r :: IO Word32
        v0 <- RM.random r :: IO Word32
        v1 <- RM.random r :: IO Word32
        v2 <- RM.random r :: IO Word32
        v3 <- RM.random r :: IO Word32
        v4 <- RM.random r :: IO Word32
        v5 <- RM.random r :: IO Word32
        v6 <- RM.random r :: IO Word32
        v7 <- RM.random r :: IO Word32
        v8 <- RM.random r :: IO Word32
        v9 <- RM.random r :: IO Word32
        vA <- RM.random r :: IO Word32
        vB <- RM.random r :: IO Word32
        vC <- RM.random r :: IO Word32
        vD <- RM.random r :: IO Word32
        vE <- RM.random r :: IO Word32
        vF <- RM.random r :: IO Word32
        b0 <- RM.random r :: IO Word32
        b1 <- RM.random r :: IO Word32
        b2 <- RM.random r :: IO Word32
        b3 <- RM.random r :: IO Word32
        b4 <- RM.random r :: IO Word32
        b5 <- RM.random r :: IO Word32
        b6 <- RM.random r :: IO Word32
        b7 <- RM.random r :: IO Word32
        b8 <- RM.random r :: IO Word32
        b9 <- RM.random r :: IO Word32
        bA <- RM.random r :: IO Word32
        bB <- RM.random r :: IO Word32
        bC <- RM.random r :: IO Word32
        bD <- RM.random r :: IO Word32
        bE <- RM.random r :: IO Word32
        bF <- RM.random r :: IO Word32
        BL.hPutStr SI.stdout  $ DBP.runPut $ do
            DBP.putWord32be x0
            DBP.putWord32be x1
            DBP.putWord32be x2
            DBP.putWord32be x3
            DBP.putWord32be x4
            DBP.putWord32be x5
            DBP.putWord32be x6
            DBP.putWord32be x7
            DBP.putWord32be x8
            DBP.putWord32be x9
            DBP.putWord32be xA
            DBP.putWord32be xB
            DBP.putWord32be xC
            DBP.putWord32be xD
            DBP.putWord32be xE
            DBP.putWord32be xF
            DBP.putWord32be c0
            DBP.putWord32be c1
            DBP.putWord32be c2
            DBP.putWord32be c3
            DBP.putWord32be c4
            DBP.putWord32be c5
            DBP.putWord32be c6
            DBP.putWord32be c7
            DBP.putWord32be c8
            DBP.putWord32be c9
            DBP.putWord32be cA
            DBP.putWord32be cB
            DBP.putWord32be cC
            DBP.putWord32be cD
            DBP.putWord32be cE
            DBP.putWord32be cF
            DBP.putWord32be v0
            DBP.putWord32be v1
            DBP.putWord32be v2
            DBP.putWord32be v3
            DBP.putWord32be v4
            DBP.putWord32be v5
            DBP.putWord32be v6
            DBP.putWord32be v7
            DBP.putWord32be v8
            DBP.putWord32be v9
            DBP.putWord32be vA
            DBP.putWord32be vB
            DBP.putWord32be vC
            DBP.putWord32be vD
            DBP.putWord32be vE
            DBP.putWord32be vF
            DBP.putWord32be b0
            DBP.putWord32be b1
            DBP.putWord32be b2
            DBP.putWord32be b3
            DBP.putWord32be b4
            DBP.putWord32be b5
            DBP.putWord32be b6
            DBP.putWord32be b7
            DBP.putWord32be b8
            DBP.putWord32be b9
            DBP.putWord32be bA
            DBP.putWord32be bB
            DBP.putWord32be bC
            DBP.putWord32be bD
            DBP.putWord32be bE
            DBP.putWord32be bF

这段代码在我的电脑上每秒输出大约6兆字节的随机字节。

快速模式下,输出速度为每秒约150兆字节。

如果我将快速模式中变量的数量从64个减少到16个,则其速度将降至每秒约78兆字节。

如何在不减慢速度的情况下使代码更紧凑和简洁?


  1. 如果您在慢代码中将“Word8”替换为“Word32”(使其与快速代码相同),会发生什么?
  2. 如果您使用循环填充一个包含64个随机数的列表,并在此之后执行“runPut”,那么它是否比使用64个变量慢得多?
  3. 这个问题更适合在Codereview.SE上提问。
- fjarri
  1. Data.Bytestring.Lazy.pack 需要 Word8 的列表;将 Word32 更改为 Word64 可能会影响快速变体。
  2. 在这里使用什么循环?map ... [0..63]sequence $ replicate 64 $ do ...?这些东西往往会生成垃圾,需要由 GC 收集...
  3. 我考虑过 CodeReview,但问题更多地是关于如何高效地在 Haskell 中执行操作,而不是关于我编写的用于调试其他程序缓慢性能的特定代码。
- Vi.
1
  1. 我只是说请求从RNG获取4个Word8与请求1个Word32相比损失多少是不确定的。
  2. 当你有100Mb/s的IO时,你过于担心垃圾收集器了。我的想法是“runPut”和“pack”之间的差异可能至关重要。
  3. 它也帮助我解决了类似的Haskell问题。它是关于解释基于你编写的特定代码的抽象习语。
- fjarri
@Bogdan:“当你有100Mb/s的IO时,你太担心垃圾收集器了。”-> 快速代码显示%GC time 9.9%,慢速代码显示%GC time 37.6% - Vi.
我在更新的答案中有一个速度为300+ MB/s的版本:由于使用了“Foreign”,所以不太习惯,但相当简洁。 - rkhayrov
显示剩余2条评论
2个回答

9

我认为在Haskell中,懒惰IO现在不被视为非常符合惯用语。它可能适用于一行代码,但对于大型IO密集型程序,Haskeller使用迭代器/导管/管道/Oleg-knows-what。

首先,为了做一个参考点,我在我的计算机上运行了您的原始版本的一些统计数据,使用GHC 7.6.3 (-O2 --make)编译,在Linux x86-64上。慢的惰性字节串版本:

$ ./rnd +RTS -s | pv | head -c 100M > /dev/null
 100MB 0:00:09 [10,4MB/s] [         <=>                                       ]
   6,843,934,360 bytes allocated in the heap
       2,065,144 bytes copied during GC
          68,000 bytes maximum residency (2 sample(s))
          18,016 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)
  ...
  Productivity  99.2% of total user, 97.7% of total elapsed

它并不是非常快,但没有GC和内存开销。有趣的是,您如何使用此代码获得了37%的GC时间。

使用展开循环的快速版本:

$ ./rndfast +RTS -s | pv | head -c 500M > /dev/null
 500MB 0:00:04 [ 110MB/s] [    <=>                                            ]
  69,434,953,224 bytes allocated in the heap
       9,225,128 bytes copied during GC
          68,000 bytes maximum residency (2 sample(s))
          18,016 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)
  ...
  Productivity  85.0% of total user, 72.7% of total elapsed

这样速度更快了,但有趣的是,现在我们有15%的垃圾回收开销。

最后,我的版本使用conduits和blaze-builder。它一次生成512个随机的Word64来产生4 KB的数据块以被下游使用。当我将列表“缓冲区”大小从32增加到512时,性能稳步提高,但在128以上改进较小。

import Blaze.ByteString.Builder (Builder)
import Blaze.ByteString.Builder.Word
import Control.Monad (forever)
import Control.Monad.IO.Class (liftIO)
import Data.ByteString (ByteString)
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import Data.Conduit.Blaze (builderToByteString)
import Data.Word
import System.IO (stdout)
import qualified System.Random.Mersenne as RM

randomStream :: RM.MTGen -> Source IO Builder
randomStream gen = forever $ do
    words <- liftIO $ RM.randoms gen
    yield $ fromWord64shost $ take 512 words

main :: IO ()
main = do
    gen <- RM.newMTGen Nothing
    randomStream gen $= builderToByteString $$ CB.sinkHandle stdout

我注意到,与上述两个程序不同的是,在使用-fllvm编译时,它稍微快了一点(3-4%),因此下面的输出是由LLVM 3.3生成的二进制文件。

$ ./rndconduit +RTS -s | pv | head -c 500M > /dev/null
 500MB 0:00:09 [53,2MB/s] [         <=>                                       ]
   8,889,236,736 bytes allocated in the heap
      10,912,024 bytes copied during GC
          36,376 bytes maximum residency (2 sample(s))
          19,024 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)
  ...
  Productivity  99.0% of total user, 91.9% of total elapsed

因此,它的速度是手动展开版本的两倍慢,但几乎与惰性IO版本一样短且易读,几乎没有GC开销并具有可预测的内存行为。也许这里还有改进的空间:欢迎评论。

更新:

通过结合一些不安全的字节操作和conduits,我成功地创建了一个程序,可以生成每秒300多MB的随机数据。看起来,简单的类型专用尾递归函数比惰性列表和手动展开都要好。

import Control.Monad (forever)
import Control.Monad.IO.Class (liftIO)
import Data.ByteString (ByteString)
import qualified Data.ByteString as B
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import Data.Word
import Foreign.Marshal.Array
import Foreign.Ptr
import Foreign.Storable
import System.IO (stdout)
import qualified System.Random.Mersenne as RM


randomChunk :: RM.MTGen -> Int -> IO ByteString
randomChunk gen bufsize = allocaArray bufsize $ \ptr -> do
    loop ptr bufsize
    B.packCStringLen (castPtr ptr, bufsize * sizeOf (undefined :: Word64))
    where
    loop :: Ptr Word64 -> Int -> IO ()
    loop ptr 0 = return ()
    loop ptr n = do
        x <- RM.random gen
        pokeElemOff ptr n x
        loop ptr (n - 1)


chunkStream :: RM.MTGen -> Source IO ByteString
chunkStream gen = forever $ liftIO (randomChunk gen 512) >>= yield


main :: IO ()
main = do
    gen <- RM.newMTGen Nothing
    chunkStream gen $$ CB.sinkHandle stdout

在这个速度下,I/O开销实际上变得明显:程序在系统调用中花费了四分之一以上的运行时间,并且像上面的示例中添加head到管道会显著减慢它的速度。

$ ./rndcond +RTS -s | pv > /dev/null
^C27GB 0:00:10 [ 338MB/s] [         <=>                                       ]
   8,708,628,512 bytes allocated in the heap
       1,646,536 bytes copied during GC
          36,168 bytes maximum residency (2 sample(s))
          17,080 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)
  ...
  Productivity  98.7% of total user, 73.6% of total elapsed

2
我可以确认第二个版本比第一个版本慢,但程度不同。在10秒钟内,短代码生成了111M的数据,而大代码生成了833M的数据。这是在Mac OSX Lion上完成的,使用7.6.3编译并带有-O3。
虽然我不知道第一个解决方案为什么如此缓慢,但第二个解决方案可以通过使用“replicateM”和“mapM”来简化以消除重复。
main3 = do
    r <- RM.newMTGen  Nothing :: IO RM.MTGen
    forever $ do
        vals <- sequence $ replicate 64 (RM.random r)
        BL.hPutStr SI.stdout $ DBP.runPut $ mapM_ DBP.putWord32be vals

尽管如此,这种解决方案仍然较慢,在10秒钟内生成了492M的数据。最后一招是使用模板Haskell来生成展开循环的代码:

main4 = do
  r <- RM.newMTGen Nothing :: IO RM.MTGen
  forever $ do
    $(let varCount = 64
          -- | replaces every instance of oldName with newName in the exp
          replaceNames :: (Typeable t, Data t) => String -> Name -> t -> t
          replaceNames oldName replacementName expr = everywhere (mkT changeName) expr where
              changeName name | nameBase name == oldName = replacementName
                              | otherwise       = name
          singleVarExp :: Name -> ExpQ -> ExpQ
          singleVarExp varName next = replaceNames "patternvar" varName <$> [| RM.random r >>= \patternvar -> $(next) |]
          allVarExps :: [Name] -> ExpQ -> ExpQ
          allVarExps (n:ns) next = foldr (\var result -> singleVarExp var result)
                                         (singleVarExp n next) ns

          singleOutputter :: Name -> ExpQ -> ExpQ
          singleOutputter varName next = [| DBP.putWord32be $(varE varName) >> $(next) |]
          allVarOutput :: [Name] -> ExpQ
          allVarOutput (n:ns) = foldr (\var result -> singleOutputter var result)
                                      (singleOutputter n [| return () |]) ns
          printResultExp :: [Name] -> ExpQ
          printResultExp names = [| BL.hPutStr SI.stdout $ DBP.runPut ($(allVarOutput names)) |]

          result = do
            vars <- replicateM varCount $ newName "x"
            allVarExps vars (printResultExp vars)
      in result)

这个版本的运行速度基本与你原来的快速版本相同。虽然它不够整洁(你的快速解决方案更易读),但你现在可以轻松地更改变量数量,并且仍然可以展开循环。我尝试过512个变量,但除了使编译时间变得巨大之外,似乎对性能没有太大影响。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接