我正在尝试处理一个非常大的Unicode文本文件(6GB+)。我想要的是统计每个唯一单词的频率。在遍历文件时,我使用严格的 Data.Map
来跟踪每个单词的计数。
这个过程需要太多的时间和内存(20GB+)。我怀疑Map很大,但我不确定它是否应该达到文件大小的5倍!
以下是代码。请注意,我尝试了以下操作:
Using
Data.HashMap.Strict
instead ofData.Map.Strict
.Data.Map
seems to perform better in terms of slower memory consumption increase rate.Reading the files using lazy
ByteString
instead of lazyText
. And then I encode it to Text do some processing and then encode it back toByteString
forIO
.import Data.Text.Lazy (Text(..), cons, pack, append) import qualified Data.Text.Lazy as T import qualified Data.Text.Lazy.IO as TI import Data.Map.Strict hiding (foldr, map, foldl') import System.Environment import System.IO import Data.Word dictionate :: [Text] -> Map Text Word16 dictionate = fromListWith (+) . (`zip` [1,1..]) main = do [file,out] <- getArgs h <- openFile file ReadMode hO <- openFile out WriteMode mapM_ (flip hSetEncoding utf8) [h,hO] txt <- TI.hGetContents h TI.hPutStr hO . T.unlines . map (uncurry ((. cons '\t' . pack . show) . append)) . toList . dictionate . T.words $ txt hFlush hO mapM_ hClose [h,hO] print "success"
bytestring-trie
的工具,这会对你有所裨益。 - J. Abrahamson