我编写了一个文件索引程序,它应该将成千上万的文本文件行读取为记录,并最终通过指纹对这些记录进行分组。它使用
也许我无法做太多来减少那么大的内存占用,但我无法解释为什么像
Data.List.Split.splitOn
来在制表符处拆分行并检索记录字段。该程序消耗了 10-20 GB 的内存。也许我无法做太多来减少那么大的内存占用,但我无法解释为什么像
splitOn
(breakDelim
) 这样的函数会消耗那么多内存: Mon Dec 9 21:07 2019 Time and Allocation Profiling Report (Final)
group +RTS -p -RTS file1 file2 -o 2 -h
total time = 7.40 secs (7399 ticks @ 1000 us, 1 processor)
total alloc = 14,324,828,696 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
fileToPairs.linesIncludingEmptyLines ImageFileRecordParser ImageFileRecordParser.hs:35:7-47 25.0 33.8
breakDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(151,1)-(156,36) 24.9 39.3
sortAndGroup Aggregations Aggregations.hs:6:1-85 12.9 1.7
fileToPairs ImageFileRecordParser ImageFileRecordParser.hs:(33,1)-(42,14) 8.2 10.7
matchDelim Data.List.Split.Internals src/Data/List/Split/Internals.hs:(73,1)-(77,23) 7.4 0.4
onSublist Data.List.Split.Internals src/Data/List/Split/Internals.hs:278:1-72 3.6 0.0
toHashesView ImageFileRecordStatistics ImageFileRecordStatistics.hs:(48,1)-(51,24) 3.0 6.3
main Main group.hs:(47,1)-(89,54) 2.9 0.4
numberOfUnique ImageFileRecord ImageFileRecord.hs:37:1-40 1.6 0.1
toHashesView.sortedLines ImageFileRecordStatistics ImageFileRecordStatistics.hs:50:7-30 1.4 0.1
imageFileRecordFromFields ImageFileRecordParser ImageFileRecordParser.hs:(11,1)-(30,5) 1.1 0.3
toHashView ImageFileRecord ImageFileRecord.hs:(67,1)-(69,23) 0.7 1.7
或者说,相比于Text
,类型[Char]
的内存效率太低了,导致splitOn
占用了大量内存吗?
更新1(用户HTNW建议使用+RTS -s
)
23,446,268,504 bytes allocated in the heap
10,753,363,408 bytes copied during GC
1,456,588,656 bytes maximum residency (22 sample(s))
29,282,936 bytes maximum slop
3620 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 45646 colls, 0 par 4.055s 4.059s 0.0001s 0.0013s
Gen 1 22 colls, 0 par 4.034s 4.035s 0.1834s 1.1491s
INIT time 0.000s ( 0.000s elapsed)
MUT time 7.477s ( 7.475s elapsed)
GC time 8.089s ( 8.094s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.114s ( 0.114s elapsed)
Total time 15.687s ( 15.683s elapsed)
%GC time 51.6% (51.6% elapsed)
Alloc rate 3,135,625,407 bytes per MUT second
Productivity 48.4% of total user, 48.4% of total elapsed
处理后的文本文件比通常小(UTF-8编码,37MB),但仍然使用了3GB内存。
更新2(代码的关键部分)
解释: fileToPairs
处理一个文本文件。它返回一个键值对列表(键:记录的指纹,值:记录本身)。
sortAndGroup associations = Map.fromListWith (++) [(k, [v]) | (k, v) <- associations]
main = do
CommandLineArguments{..} <- cmdArgs $ CommandLineArguments {
ignored_paths_file = def &= typFile,
files = def &= typ "FILES" &= args,
number_of_occurrences = def &= name "o",
minimum_number_of_occurrences = def &= name "l",
maximum_number_of_occurrences = def &= name "u",
number_of_hashes = def &= name "n",
having_record_errors = def &= name "e",
hashes = def
}
&= summary "Group image/video files"
&= program "group"
let ignoredPathsFilenameMaybe = ignored_paths_file
let filenames = files
let hashesMaybe = hashes
ignoredPaths <- case ignoredPathsFilenameMaybe of
Just ignoredPathsFilename -> ioToLines (readFile ignoredPathsFilename)
_ -> return []
recordPairs <- mapM (fileToPairs ignoredPaths) filenames
let allRecordPairs = concat recordPairs
let groupMap = sortAndGroup allRecordPairs
let statisticsPairs = map toPair (Map.toList groupMap) where toPair item = (fst item, imageFileRecordStatisticsFromRecords . snd $ item)
let filterArguments = FilterArguments {
numberOfOccurrencesMaybe = number_of_occurrences,
minimumNumberOfOccurrencesMaybe = minimum_number_of_occurrences,
maximumNumberOfOccurrencesMaybe = maximum_number_of_occurrences,
numberOfHashesMaybe = number_of_hashes,
havingRecordErrorsMaybe = having_record_errors
}
let filteredPairs = filterImageRecords filterArguments statisticsPairs
let filteredMap = Map.fromList filteredPairs
case hashesMaybe of
Just True -> mapM_ putStrLn (map toHashesView (map snd filteredPairs))
_ -> Char8.putStrLn (encodePretty filteredMap)
[Char]
不够高效,应尽可能避免使用。 - user1198582[Char]
每个字符至少使用24个字节!请使用Data.Text
。 - luquitotal alloc
不是实际使用的内存量。尽管将其保持在较低水平是有好处的,但实际的最大内存由+RTS -s
中的“maximum residency” 给出。总分配通常是“工作”的度量,大部分是临时对象。无论如何,在这里看到实际的代码是必要的。此外,粗心/自动化 profiling 可能会阻止重要的优化。+RTS -s
可以在没有 profiling 的情况下工作,请检查(当然需要重新编译并关闭 profiling)。根据 @luqui 的说法,每行 1,000 个字符的 100,000 行仍然 <3GB 在 NF[Char]
中,这可能意味着某些问题。 - HTNW