为什么我的函数的 pointfree 版本会使用更多的内存？

Question

为什么我的函数的 pointfree 版本会使用更多的内存？

16

我正在解决一个欧拉计划问题，最终得到了一个包含以下函数的Haskell文件：

matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f cs = foldr (\(cs', n) a -> fromBool (f cs cs') * n + a) 0

通过从 Foreign.Marshal.Utils 导入的 fromBool，可以快速将 True 转换为 1，False 转换为 0。

我试图让解决方案变得更快一些，所以我尝试从 foldr 切换到 foldl'（在过程中切换参数），因为我认为在数字上使用 foldr 没有太多意义。

根据 GHC 的分析工具，从 foldr 切换到 foldl' 导致我分配的内存增加了两倍以上。

为了好玩，我还决定用函数的点式版本来替换 lambda 表达式：

matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f cs = foldr ((+) . uncurry ((*) . fromBool . f cs)) 0

这导致我的内存分配比foldr版本增加了20倍。

现在即使在20倍情况下，总内存分配量也只有约 135Mb，而程序的运行时间相对不受影响，如果有的话，更高内存分配版本甚至会略微快一些。

但我真的很好奇这些结果是如何可能的，因此将来我能够在没有太多余地的情况下选择 "正确" 的函数。

编辑：

GHC版本为7.10.2，编译时使用了-O2 -prof -fprof-auto。用+RTS -p执行。

编辑2：

好吧，看起来这太难再现了，无法省略其余代码，那么这就是整个程序：

以下包含剧透：

{-# LANGUAGE NoMonomorphismRestriction #-}

import Control.Monad
import Data.List
import Foreign.Marshal.Utils

data Color = Red | Green | Blue deriving (Eq, Enum, Bounded, Show)

colors :: [Color]
colors = [Red ..]

matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
matches f x = foldr ((+) . uncurry ((*) . fromBool . f x)) 0
-- matches f x = foldr (\(y, n) a -> fromBool (f x y) * n + a) 0
-- matches f x = foldl' (\a (y, n) -> fromBool (f x y) * n + a) 0

invert :: [([Color], Int)] -> [([Color], Int)]
invert rs = (\cs -> (cs, matches valid cs rs)) <$> choices
  where
    len = maximum $ length . fst <$> rs
    choices = replicateM len colors
    valid (x : xs) (y : ys) = x /= y && valid xs ys
    valid _ _ = True

expand :: [([Color], Int)] -> [([Color], Int)]
expand rs = (\cs -> (cs, matches valid cs rs)) <$> choices
  where
    len = maximum $ length . fst <$> rs
    choices = replicateM (len + 1) colors
    valid (x1 : x2 : xs) (y : ys) = x1 /= y && x2 /= y && valid (x2 : xs) ys
    valid _ _ = True

getRow :: Int -> [([Color], Int)]
getRow 1 = flip (,) 1 . pure <$> colors
getRow n = expand . invert $ getRow (n - 1)

result :: Int -> Int
result n = sum $ snd <$> getRow n

main :: IO ()
main = print $ result 8

- semicolon

1

顺便问一下：cs 列表本身是什么，你的 f 又是什么？ - Random Dev

1

为什么要使用 fromBool 而不是直接使用 fromEnum？ - dfeuer

1

虽然我有一个解释，但是完全复制你的行为真的很困难。罪魁祸首是 uncurry 和一些优化，这是肯定的，但为了提供完整的答案，一些示例函数/值会非常有帮助。 - Zeta

1

Grah. 这太荒谬了。如果你使用 a ~ Int， GHC 会创建一个具有相同分配行为的无点变体，除非你调用函数两次. 如果你使用 a ~ Bool，无点函数总是导致分配行为。所以，长话短说，要精确地复制你的行为是很麻烦的。请至少添加 a 的使用类型。 - Zeta

1

@dfeuer，现在我感觉很蠢，但是在我搜索“Bool -> Int”时，fromBool出现在hoogle中fromEnum之前，这是我的辩护。 - semicolon

显示剩余8条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Zeta · Accepted Answer

^{注意: 这篇文章是用文学 Haskell 写成的。将其复制到文件中，保存为 *.lhs 文件，并在 GHC(i) 中编译/加载。此外，我在你修改代码之前开始撰写本答案，但教训始终不变。}

概括

Prelude 函数 uncurry 太懒了，而你的模式匹配正好足够严格。

注意事项和免责声明

我们正在进入一个神奇、奇怪的地方。请小心。此外，我的核心能力很基础。既然我已经失去了所有的信誉，那么让我们开始吧。

测试过的代码

为了知道我们为什么需要额外的内存需求，有多个函数是有用的。

> import Control.Monad (forM_)

这是您的原始非 pointfree 变体：

> matches :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matches    f cs = foldr (\(cs', n) a -> fromEnum (f cs cs') * n + a) 0

这是一个稍微有些点-free的变体，参数a被 eta-reduced。

> matchesPF' :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPF' f cs = foldr (\(cs', n) -> (+) (fromEnum (f cs cs') * n)) 0

这是一种手动内联 uncurry 的变体。

> matchesPFI :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPFI f cs = foldr ((+) . (\(cs', n) -> fromEnum (f cs cs') * n)) 0

这是您的无参版本。

> matchesPF :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPF  f cs = foldr ((+) . uncurry  ((*) . fromEnum . f cs)) 0

这是一个使用自定义 uncurry 的变体，请参见下面。

> matchesPFU :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPFU f cs = foldr ((+) . uncurryI ((*) . fromEnum . f cs)) 0

这是一种使用自定义惰性uncurry的变体，见下文。

> matchesPFL :: (a -> a -> Bool) -> a -> [(a, Int)] -> Int
> matchesPFL f cs = foldr ((+) . uncurryL ((*) . fromEnum . f cs)) 0

为了方便测试功能，我们使用一个列表：

> funcs = [matches, matchesPF', matchesPF, matchesPFL, matchesPFU, matchesPFI]

我们自己编写的 uncurry 函数：

> uncurryI :: (a -> b -> c) -> (a, b) -> c
> uncurryI f (a,b) = f a b

一个懒惰的uncurry:

> uncurryL :: (a -> b -> c) -> (a, b) -> c
> uncurryL f p = f (fst p) (snd p)

懒惰的变体 uncurryL 与 Prelude 中的变体具有相同的语义，例如。

uncurry (\_ _ -> 0) undefined == 0 == uncurryL (\_ _ -> 0) undefined

uncurryI 严格要求元组的结构。

> main = do
>   let f a b = a < b
>   forM_ [1..10] $ \i ->
>     forM_ funcs $ \m ->
>       print $ m f i (zip (cycle [1..10]) [1..i*100000])

这个列表[1..i*100000]是有意取决于i的，以便我们不引入CAF并扭曲我们的分配概况。

展开后的代码

在我们深入了解概况之前，让我们先看看每个函数的展开后的代码：

==================== Desugar (after optimization) ====================
Result size of Desugar (after optimization)
  = {terms: 221, types: 419, coercions: 0}

uncurryL
uncurryL = \ @ a @ b @ c f p -> f (fst p) (snd p)

uncurryI
uncurryI = \ @ a @ b @ c f ds -> case ds of _ { (a, b) -> f a b }

-- uncurried inlined by hand
matchesPFI =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (. (+ $fNumInt)
         (\ ds ->
            case ds of _ { (cs', n) ->
            * $fNumInt (fromEnum $fEnumBool (f cs cs')) n
            }))
      (I# 0)

-- lazy uncurry
matchesPFL =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (. (+ $fNumInt)
         (uncurryL (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
      (I# 0)

-- stricter uncurry
matchesPFU =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (. (+ $fNumInt)
         (uncurryI (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
      (I# 0)

-- normal uncurry
matchesPF =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (. (+ $fNumInt)
         (uncurry (. (* $fNumInt) (. (fromEnum $fEnumBool) (f cs)))))
      (I# 0)

-- eta-reduced a
matchesPF' =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (\ ds ->
         case ds of _ { (cs', n) ->
         + $fNumInt (* $fNumInt (fromEnum $fEnumBool (f cs cs')) n)
         })
      (I# 0)

-- non-point-free
matches =
  \ @ a f cs ->
    foldr
      $fFoldable[]
      (\ ds a ->
         case ds of _ { (cs', n) ->
         + $fNumInt (* $fNumInt (fromEnum $fEnumBool (f cs cs')) n) a
         })
      (I# 0)

到目前为止，一切似乎都很好。没有什么令人惊讶的事情发生。类型类函数被它们的字典变体替换，例如foldr变成foldr $ f Foldable []，因为我们在列表上调用它。

配置文件

2016年7月18日15:47 时间和分配的性能报告 (最终版)
Prof +RTS -s -p -RTS
总时间 = 1.45秒(1446个时钟周期@ 1000微秒,1个处理器)
总分配= 1,144,197,200字节(不包括分析开销)
成本中心   模块     %时间 %分配
matchesPF'  Main   13.6   0.0
matchesPF   Main   13.3   11.5
main.\.\    Main   11.8   76.9
main.f      Main   10.9   0.0
uncurryL    Main   9.5    11.5
matchesPFU  Main   8.9    0.0
matchesPFI  Main   7.3    0.0
matches     Main   6.9    0.0
matchesPFL  Main   6.3    0.0
uncurryI    Main   5.3    0.0
matchesPF'.\Main   2.6    0.0
matchesPFI.\Main   2.0    0.0
matches.\   Main   1.5    0.0
单独      继承
成本中心          模块                 记录数   条目     %时间 %分配  %时间 %分配
MAIN            MAIN                  44         0     0.0    0.0  100.0 100.0
 main           Main                  89         0     0.0    0.0  100.0 100.0
  main.\        Main                  90        10     0.0    0.0  100.0 100.0
   main.\.\     Main                  92        60     11.8   76.9  100.0 100.0
    funcs       Main                  93         0     0.0    0.0   88.2 23.1
     matchesPFI Main                  110       10     7.3    0.0   11.7 0.0
      matchesPFI.\Main               111  5500000     2.0    0.0   4.4  0.0
       main.f   Main                  112  5500000     2.4    0.0   2.4  0.0
     matchesPFU Main                  107       10     8.9    0.0   15.3 0.0
      uncurryI  Main                  108  5500000     5.3    0.0   6.4  0.0
       main.f   Main                  109  5500000     1.1    0.0   1.1  0.0
     matchesPFL Main                  104       10     6.3    0.0   17.7 11.5
      uncurryL  Main                  105  5500000     9.5    11.5  11.4 11.5
       main.f   Main                  106  5500000     1.9    0.0   1.9  0.0
     matchesPF  Main                  102       10     13.3   11.5  15.4 11.5
      main.f    Main                  103  5500000     2.1    0.0   2.1  0.0
     matchesPF' Main                  99        10     13.6   0.0   17.2 0.0
      matchesPF'.\Main               100  5500000     2.6    0.0   3.6  0.0
       main.f   Main                  101  5500000     1.0    0.0   1.0  0.0
     matches    Main                  94        10     6.9    0.0  忽略main\.\.的噪音，那只是列表。不过，有一点需要立即注意：matchesPF和uncurryL使用相同的alloc％：
matchesPF    Main       13.3   11.5
uncurryL     Main        9.5   11.5
深入了解 CORE
现在是时候检查生成的 CORE 程序（ghc -ddump-simpl）了。我们会发现大部分函数已经被转换为 worker wrappers，并且它们看起来都差不多（-dsuppress-all -dsuppress-uniques）：
$wa5
$wa5 =
  \ @ a1 w w1 w2 ->
    letrec {
      $wgo
      $wgo =
        \ w3 ->
          case w3 of _ {
            [] -> 0;
            : y ys ->
              case y of _ { (cs', n) ->
              case $wgo ys of ww { __DEFAULT ->
              case w w1 cs' of _ {
                False -> case n of _ { I# y1 -> ww };
                True -> case n of _ { I# y1 -> +# y1 ww }
              }
              }
              }
          }; } in
    $wgo w2
这是您通常的工人封装。 $wgo接受一个列表，检查它是否为空，在头部(case y of_ {(cs'，n) ->…)中严格，在递归结果$wgo ys of ww中则是惰性的。

所有函数看起来都一样。嗯，除了 matchesPF(您的变量)
-- matchesPF
$wa3 =
  \ @ a1 w w1 w2 ->
    letrec {
      $wgo =
        \ w3 ->
          case w3 of _ {
            [] -> 0;
            : y ys ->
              case $wgo ys of ww { __DEFAULT ->
              case let {
                     x = case y of _ { (x1, ds) -> x1 } } in
                   case w w1 x of _ {
                     False ->
                       case y of _ { (ds, y1) -> case y1 of _ { I# y2 -> main13 } };
                              -- main13 is just #I 0
                     True -> case y of _ { (ds, y1) -> y1 }
                   }
              of _ { I# x ->
              +# x ww
              }
              }
          }; } in
    $wgo w2
并且 matchesPFL （使用惰性的 uncurryL 变体）
-- matchesPFL
$wa2
$wa2 =
  \ @ a1 w w1 w2 ->
    letrec {
      $wgo =
        \ w3 ->
          case w3 of _ {
            [] -> 0;
            : y ys ->
              case $wgo ys of ww { __DEFAULT ->
              case snd y of ww1 { I# ww2 ->
              case let {
                     x = fst y } in
                   case w w1 x of _ {
                     False -> main13;
                     True -> ww1
                   }
              of _ { I# x ->
              +# x ww
              }
              }
              }
          }; } in
    $wgo w2
它们基本上是相同的。 它们都包含let绑定。这将创建一个thunk并通常导致更糟糕的空间要求。

解决方案

我认为此时罪魁祸首很明显。 那就是uncurry。 GHC想要强制执行正确的语义。
uncurry (const (const 0)) undefined
不过，这会增加懒惰和额外的thunk。你的非pointfree变体没有引入那种行为，因为你在对成对进行模式匹配：
。foldr (\(cs', n) a -> …)
还是不相信我？使用惰性模式匹配
foldr (\ ~(cs', n) a -> …)
你会注意到matches与matchesPF的行为相同。因此，使用稍微严格一些的uncurry变体。uncurryI足以给严格性分析器提供提示。

请注意，对于这种行为，成对出现的元素是臭名昭著的。《RWH》花了一个章节的时间来优化单个函数的行为，其中中间的对导致了问题。