GHC能够可靠地执行哪些优化？

Question

GHC能够可靠地执行哪些优化？

196

GHC有很多可以执行的优化，但我不知道它们都是什么，也不知道它们在什么情况下以及有多大的可能性被执行。

我的问题是：每次或几乎每次可以期望应用的变换是什么？如果我看到一个将要频繁执行（求值）的代码片段，并且我的第一个想法是“嗯，也许我应该优化它”，那么在哪些情况下我的第二个想法应该是“别想了，GHC已经搞定了”？

我正在阅读论文Stream Fusion: From Lists to Streams to Nothing at All，其中他们使用将列表处理重写为不同形式的技术，GHC的常规优化可靠地将其优化为简单的循环，这对我来说是新颖的。我如何知道我的程序是否有资格进行这种优化？

GHC手册中有一些信息，但它只是部分回答了这个问题。

编辑：我发起了一项悬赏。我想要的是一个类似于lambda/let/case-floating、类型/构造函数/函数参数特化、严格性分析和解包、worker/wrapper等更低层次的变换列表，以及这些变换的输入输出代码的解释和示例，最好还有总效果大于部分效果的情况下的说明和插图。理想情况下，还应该提到什么情况下不会进行转换。我并不期望每个转换都有小说长度的解释，几句话和单行内联代码示例可能就足够了（如果不是二十页的科学论文，则可以提供链接），只要最终能够清楚地呈现整体情况即可。我希望能够查看一段代码，并且能够猜测它是否能够编译成紧凑的循环，或者为什么不能，或者需要做哪些改变才能让它编译成功。（我在这里并不是非常关心像流融合这样的大型优化框架（我刚读过一篇关于它的论文）；我更关心那些撰写这些框架的人所拥有的知识。）

- glaebhoerl

13

这是一个非常值得探讨的问题，写一个恰当的答案有些棘手。 - MathematicalOrchid

1

一个非常好的起点是这个：http://www.aosabook.org/en/ghc.html - Gabriella Gonzalez

8

无论哪种语言，如果你第一个想到的是“也许我应该优化它”，那么你的第二个想法应该是“我要先进行性能分析”。 - John L

4

虽然你需要的那种知识很有用，所以这仍然是一个好问题，但我认为尽可能少地进行优化会更好。写下你的意思，只有当明显需要时才考虑为了性能而使代码变得不太直观。不要看代码然后想“那段代码会经常执行，也许我应该对其进行优化”，而应该是只有当你观察到代码运行过慢时才想“我应该找出哪些代码经常被执行并进行优化”。 - Ben

14

我完全预料到这部分会引起"剖析它！"的呼声。但我猜另一方面是，如果我对它进行了剖析而且速度很慢，也许我可以重写或稍加调整成仍然高水平但 GHC 可以更好地优化的形式，而不是自己手动优化。这需要相同类型的知识。如果我一开始就有那些知识，我本来可以省去编辑和剖析的循环。 - glaebhoerl

1

我希望这个赏金可以用来改进Haskell维基上某个属于它并且可以再次找到的答案。话虽如此，我倾向于通过仔细计时和想象可能发生的事情来优化Haskell代码。我不确定知道自己是正确的如何实际帮助，重要的是时间本身。 - Syzygies

3个回答

67

懒惰求值

这不是"编译器优化"，而是语言规范保证的一种行为，因此您始终可以指望它发生。从本质上讲，这意味着在"使用结果"之前，不会执行任何工作。（除非您采取某些措施来刻意关闭延迟）。显然，这是一个独立的主题，在SO上已经有很多相关问题和答案。

在我的有限经验中，使代码太懒惰或太严格相比我接下来要谈论的任何其他内容，都具有极大的性能惩罚（时间和空间）...

严格性分析

懒惰求值是关于避免不必要的工作。如果编译器能够确定给定的结果将 "总是" 被需要，那么它就不会费心存储计算并在以后执行它；而是直接执行它，因为这更有效率。这就是所谓的 "严密性分析"。

显然，陷阱是编译器不能总是检测到何时应该变得严格。有时候，您需要给编译器一些提示。（除了浏览Core输出之外，我不知道有什么简单的方法来确定严格性分析是否已经做到了您想要的。）

内联

如果您调用一个函数，并且编译器可以确定您正在调用哪个函数，它可能会尝试 "内联" 该函数-即使用函数本身的副本替换函数调用。函数调用的开销通常非常小，但内联通常使其他优化成为可能，否则这些优化不会发生，因此内联可能是一个大胜利。

只有当函数足够 "小" （或者您添加了一个特定的 pragma 要求内联）时，才会内联函数。此外，只有当编译器能够确定您正在调用哪个函数时，才可以内联函数。编译器无法确定的两种主要方式：

如果您正在调用的函数是从其他地方传递的。例如，在编译 filter 函数时，您不能内联过滤谓词，因为它是用户提供的参数。
如果您正在调用的函数是类方法，并且编译器不知道涉及的类型。例如，在编译 sum 函数时，编译器无法内联 + 函数，因为sum 使用了几种不同的数字类型，每种类型都有不同的+ 函数。

在后一种情况下，您可以使用{-# SPECIALIZE #-} pragma来生成特定类型的函数版本。例如，{-# SPECIALIZE sum :: [Int] -> Int #-}会编译一个为Int类型硬编码的sum版本，这意味着在此版本中，+可以内联。

请注意，当编译器能够确定我们正在处理Int时，才会调用我们的新特殊sum函数。否则，将调用原始的多态sum函数。同样，实际函数调用开销非常小。有益的是内联可以启用的额外优化。 公共子表达式消除 如果某个代码块计算相同的值两次，则编译器可以将其替换为同一计算的单个实例。例如，如果您执行以下操作：

(sum xs + 1) / (sum xs + 2)

那么编译器可能会将其优化为

let s = sum xs in (s+1)/(s+2)

你可能期望编译器总是这样做，但是显然在某些情况下，这样做会导致性能更差，而不是更好，因此GHC并不总是这样做。坦白地说，我真的不太明白其中的细节。但底线是，如果这个转换对你很重要，那么手动完成它并不难。（如果不重要，那么为什么还要担心呢？）

Case表达式

考虑下面的例子：

foo (0:_ ) = "zero"
foo (1:_ ) = "one"
foo (_:xs) = foo xs
foo (  []) = "end"

前三个方程都检查列表是否非空（以及其他一些事情）。但是多次检查同一件事是浪费的。幸运的是，编译器很容易将此优化为几个嵌套的情况表达式。在这种情况下，类似于以下内容：

foo xs =
  case xs of
    y:ys ->
      case y of
        0 -> "zero"
        1 -> "one"
        _ -> foo ys
    []   -> "end"

这种方法不太直观，但更有效率。由于编译器可以轻松地进行此转换，因此您不必担心它。只需以最直观的方式编写模式匹配即可；编译器非常擅长重新排序和重新排列以使其尽可能快。

融合

列表处理的标准Haskell习语是将接受一个列表并生成新列表的函数链接在一起。其典型示例为

map g . map f

遗憾的是，尽管懒惰保证跳过不必要的工作，但所有中间列表的分配和释放都会削弱性能。编译器尝试消除这些中间步骤的方法称为“融合”或“去枝”。

问题在于，大多数这些功能都是递归的。没有递归，将所有函数挤入一个大代码块并在其上运行简化器以生成没有中间列表的真正最优代码将成为一个基本练习。但由于递归，这种方法不起作用。

您可以使用{-# RULE #-}指示符来解决其中一些问题。例如，

{-# RULES "map/map" forall f g xs. map f (map g xs) = map (f.g) xs #-}

现在每次 GHC 看到应用于另一个 map 的 map 时，它都会将其压缩为对列表的单个遍历，消除中间列表。

问题是，这仅适用于 map 后面跟着 map 的情况。还有许多其他可能性 - map 后跟 filter，filter 后跟 map 等等。而不是为每个情况手动编写解决方案，所谓的“流融合”被发明出来。这是一个更复杂的技巧，我不会在这里描述。

说长话短说：这些都是由程序员编写的特殊优化技巧。GHC 本身对融合一无所知；它全部在列表库和其他容器库中。因此，发生哪些优化取决于您的容器库如何编写（或者更现实地说，取决于您选择使用哪些库）。

例如，如果您使用 Haskell '98 数组，则不要指望进行任何形式的融合。但是我了解到 vector 库具有广泛的融合能力。这都是关于库的事情；编译器只提供 RULES pragma。（顺便说一句，这非常强大。作为库作者，您可以使用它重写客户端代码！）

元信息：

我同意那些说“先编码，再分析，最后优化”的人。
我也同意那些说“为设计决策花费多少成本拥有一种心理模型是有用的”的人。

平衡万物，一切皆有度...

- MathematicalOrchid

9

不完全正确。语言规范承诺了“非严格语义”，但它并不保证是否会执行多余的工作，除非你对结果进行了“某些操作”。 - Dan Burton

1

@DanBurton 当然可以。但是用几句话来解释这个问题并不容易。此外，由于 GHC 几乎是唯一存在的 Haskell 实现，GHC 是惰性的这一事实对大多数人来说已经足够了。 - MathematicalOrchid

5

关于CSE（公共子表达式消除）：我的印象是它几乎不会经常使用，因为它可能会引入不必要的共享导致空间泄漏。 - Joachim Breitner

2

抱歉之前没有回复你，也没有接受你的答案。虽然你的答案很长而且令人印象深刻，但并没有涵盖我想要的领域。我想要的是一个低级转换列表，例如lambda/let/case-floating、type/constructor/function argument specialization、strictness analysis和unboxing（你提到了），worker/wrapper以及GHC执行的其他操作，以及它们的输入和输出代码的解释和示例，最好还有它们的组合效果和不进行转换的示例。我想我应该设置一个赏金？ - glaebhoerl

@illissius 这听起来像是一个太大的问题，无法在 Stack Overflow 的回答中解决。我承认这也听起来非常有趣... - MathematicalOrchid

显示剩余3条评论

8

如果一个let绑定v = rhs只在一个地方使用，即使rhs很大，您可以指望编译器将其内联。

例外情况（在当前问题的上下文中几乎不算例外）是可能导致工作重复的lambda表达式。请考虑：

let v = rhs
    l = \x-> v + x
in map l [1..100]

在这里进行v内联会很危险，因为一个（语法）用法将转化为rhs的99个额外评估。然而，在这种情况下，您也很不可能手动进行内联。因此，您可以使用以下规则：

如果您考虑内联仅出现一次的名称，则编译器将自动执行它。

作为一个幸福的推论，仅使用let绑定来分解长语句（希望获得清晰度）基本上是免费的。

这来自community.haskell.org/~simonmar/papers/inline.pdf，其中包括更多有关内联的信息。

- Daniel

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- gereeter · Accepted Answer

这个 GHC Trac 页面也相当好地解释了 passes。这个页面解释了优化排序，但像大多数Trac Wiki一样，它已经过时。

具体来说，最好的方法可能是查看如何编译特定程序。查看哪些优化正在执行的最好方法是使用-v标志详细编译程序。以我在计算机上找到的第一个Haskell代码片段为例：

Glasgow Haskell Compiler, Version 7.4.2, stage 2 booted by GHC version 7.4.1
Using binary package database: /usr/lib/ghc-7.4.2/package.conf.d/package.cache
wired-in package ghc-prim mapped to ghc-prim-0.2.0.0-7d3c2c69a5e8257a04b2c679c40e2fa7
wired-in package integer-gmp mapped to integer-gmp-0.4.0.0-af3a28fdc4138858e0c7c5ecc2a64f43
wired-in package base mapped to base-4.5.1.0-6e4c9bdc36eeb9121f27ccbbcb62e3f3
wired-in package rts mapped to builtin_rts
wired-in package template-haskell mapped to template-haskell-2.7.0.0-2bd128e15c2d50997ec26a1eaf8b23bf
wired-in package dph-seq not found.
wired-in package dph-par not found.
Hsc static flags: -static
*** Chasing dependencies:
Chasing modules from: *SleepSort.hs
Stable obj: [Main]
Stable BCO: []
Ready for upsweep
  [NONREC
      ModSummary {
         ms_hs_date = Tue Oct 18 22:22:11 CDT 2011
         ms_mod = main:Main,
         ms_textual_imps = [import (implicit) Prelude, import Control.Monad,
                            import Control.Concurrent, import System.Environment]
         ms_srcimps = []
      }]
*** Deleting temp files:
Deleting: 
compile: input file SleepSort.hs
Created temporary directory: /tmp/ghc4784_0
*** Checking old interface for main:Main:
[1 of 1] Compiling Main             ( SleepSort.hs, SleepSort.o )
*** Parser:
*** Renamer/typechecker:
*** Desugar:
Result size of Desugar (after optimization) = 79
*** Simplifier:
Result size of Simplifier iteration=1 = 87
Result size of Simplifier iteration=2 = 93
Result size of Simplifier iteration=3 = 83
Result size of Simplifier = 83
*** Specialise:
Result size of Specialise = 83
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = False}) = 95
*** Float inwards:
Result size of Float inwards = 95
*** Simplifier:
Result size of Simplifier iteration=1 = 253
Result size of Simplifier iteration=2 = 229
Result size of Simplifier = 229
*** Simplifier:
Result size of Simplifier iteration=1 = 218
Result size of Simplifier = 218
*** Simplifier:
Result size of Simplifier iteration=1 = 283
Result size of Simplifier iteration=2 = 226
Result size of Simplifier iteration=3 = 202
Result size of Simplifier = 202
*** Demand analysis:
Result size of Demand analysis = 202
*** Worker Wrapper binds:
Result size of Worker Wrapper binds = 202
*** Simplifier:
Result size of Simplifier = 202
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = True}) = 210
*** Common sub-expression:
Result size of Common sub-expression = 210
*** Float inwards:
Result size of Float inwards = 210
*** Liberate case:
Result size of Liberate case = 210
*** Simplifier:
Result size of Simplifier iteration=1 = 206
Result size of Simplifier = 206
*** SpecConstr:
Result size of SpecConstr = 206
*** Simplifier:
Result size of Simplifier = 206
*** Tidy Core:
Result size of Tidy Core = 206
writeBinIface: 4 Names
writeBinIface: 28 dict entries
*** CorePrep:
Result size of CorePrep = 224
*** Stg2Stg:
*** CodeGen:
*** CodeOutput:
*** Assembler:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-I.' '-c' '/tmp/ghc4784_0/ghc4784_0.s' '-o' 'SleepSort.o'
Upsweep completely successful.
*** Deleting temp files:
Deleting: /tmp/ghc4784_0/ghc4784_0.c /tmp/ghc4784_0/ghc4784_0.s
Warning: deleting non-existent /tmp/ghc4784_0/ghc4784_0.c
link: linkables are ...
LinkableM (Sat Sep 29 20:21:02 CDT 2012) main:Main
   [DotO SleepSort.o]
Linking SleepSort ...
*** C Compiler:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-c' '/tmp/ghc4784_0/ghc4784_0.c' '-o' '/tmp/ghc4784_0/ghc4784_0.o' '-DTABLES_NEXT_TO_CODE' '-I/usr/lib/ghc-7.4.2/include'
*** C Compiler:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-c' '/tmp/ghc4784_0/ghc4784_0.s' '-o' '/tmp/ghc4784_0/ghc4784_1.o' '-DTABLES_NEXT_TO_CODE' '-I/usr/lib/ghc-7.4.2/include'
*** Linker:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-o' 'SleepSort' 'SleepSort.o' '-L/usr/lib/ghc-7.4.2/base-4.5.1.0' '-L/usr/lib/ghc-7.4.2/integer-gmp-0.4.0.0' '-L/usr/lib/ghc-7.4.2/ghc-prim-0.2.0.0' '-L/usr/lib/ghc-7.4.2' '/tmp/ghc4784_0/ghc4784_0.o' '/tmp/ghc4784_0/ghc4784_1.o' '-lHSbase-4.5.1.0' '-lHSinteger-gmp-0.4.0.0' '-lgmp' '-lHSghc-prim-0.2.0.0' '-lHSrts' '-lm' '-lrt' '-ldl' '-u' 'ghczmprim_GHCziTypes_Izh_static_info' '-u' 'ghczmprim_GHCziTypes_Czh_static_info' '-u' 'ghczmprim_GHCziTypes_Fzh_static_info' '-u' 'ghczmprim_GHCziTypes_Dzh_static_info' '-u' 'base_GHCziPtr_Ptr_static_info' '-u' 'base_GHCziWord_Wzh_static_info' '-u' 'base_GHCziInt_I8zh_static_info' '-u' 'base_GHCziInt_I16zh_static_info' '-u' 'base_GHCziInt_I32zh_static_info' '-u' 'base_GHCziInt_I64zh_static_info' '-u' 'base_GHCziWord_W8zh_static_info' '-u' 'base_GHCziWord_W16zh_static_info' '-u' 'base_GHCziWord_W32zh_static_info' '-u' 'base_GHCziWord_W64zh_static_info' '-u' 'base_GHCziStable_StablePtr_static_info' '-u' 'ghczmprim_GHCziTypes_Izh_con_info' '-u' 'ghczmprim_GHCziTypes_Czh_con_info' '-u' 'ghczmprim_GHCziTypes_Fzh_con_info' '-u' 'ghczmprim_GHCziTypes_Dzh_con_info' '-u' 'base_GHCziPtr_Ptr_con_info' '-u' 'base_GHCziPtr_FunPtr_con_info' '-u' 'base_GHCziStable_StablePtr_con_info' '-u' 'ghczmprim_GHCziTypes_False_closure' '-u' 'ghczmprim_GHCziTypes_True_closure' '-u' 'base_GHCziPack_unpackCString_closure' '-u' 'base_GHCziIOziException_stackOverflow_closure' '-u' 'base_GHCziIOziException_heapOverflow_closure' '-u' 'base_ControlziExceptionziBase_nonTermination_closure' '-u' 'base_GHCziIOziException_blockedIndefinitelyOnMVar_closure' '-u' 'base_GHCziIOziException_blockedIndefinitelyOnSTM_closure' '-u' 'base_ControlziExceptionziBase_nestedAtomically_closure' '-u' 'base_GHCziWeak_runFinalizzerBatch_closure' '-u' 'base_GHCziTopHandler_flushStdHandles_closure' '-u' 'base_GHCziTopHandler_runIO_closure' '-u' 'base_GHCziTopHandler_runNonIO_closure' '-u' 'base_GHCziConcziIO_ensureIOManagerIsRunning_closure' '-u' 'base_GHCziConcziSync_runSparks_closure' '-u' 'base_GHCziConcziSignal_runHandlers_closure'
link: done
*** Deleting temp files:
Deleting: /tmp/ghc4784_0/ghc4784_1.o /tmp/ghc4784_0/ghc4784_0.s /tmp/ghc4784_0/ghc4784_0.o /tmp/ghc4784_0/ghc4784_0.c
*** Deleting temp dirs:
Deleting: /tmp/ghc4784_0

从第一个***简化器：到最后，所有的优化阶段都在这里进行，我们看到了很多东西。

首先，简化器在几乎所有阶段之间运行。这使得编写许多优化变得更容易。例如，在实现许多优化时，它们只需创建重写规则来传播更改，而不必手动执行。简化器包含许多简单的优化，包括内联和融合。我所知道的主要限制是GHC拒绝内联递归函数，并且必须正确命名才能进行融合。

接下来，我们看到所有执行的优化的完整列表：

Specialise

The basic idea of specialization is to remove polymorphism and overloading by identifying places where the function is called and creating versions of the function that aren't polymorphic - they are specific to the types they are called with. You can also tell the compiler to do this with the SPECIALISE pragma. As an example, take a factorial function:
```
 fac :: (Num a, Eq a) => a -> a
 fac 0 = 1
 fac n = n * fac (n - 1)
```
As the compiler doesn't know any properties of the multiplication that is to be used, it cannot optimize this at all. If however, it sees that it is used on an Int, it now can create a new version, differing only in the type:
```
 fac_Int :: Int -> Int
 fac_Int 0 = 1
 fac_Int n = n * fac_Int (n - 1)
```
Next, rules mentioned below can fire, and you end up with something working on unboxed Ints, which is much faster than the original. Another way to look at specialisation is partial application on type class dictionaries and type variables.

The source here has a load of notes in it.
Float out

EDIT: I apparently misunderstood this before. My explanation has completely changed.

The basic idea of this is to move computations that shouldn't be repeated out of functions. For example, suppose we had this:
```
 \x -> let y = expensive in x+y
```
In the above lambda, every time the function is called, y is recomputed. A better function, which floating out produces, is
```
 let y = expensive in \x -> x+y
```
To facilitate the process, other transformations may be applied. For example, this happens:
```
  \x -> x + f 2
  \x -> x + let f_2 = f 2 in f_2
  \x -> let f_2 = f 2 in x + f_2
  let f_2 = f 2 in \x -> x + f_2
```
Again, repeated computation is saved.

The source is very readable in this case.

At the moment bindings between two adjacent lambdas are not floated. For example, this does not happen:
```
 \x y -> let t = x+x in ...
```
going to
```
  \x -> let t = x+x in \y -> ...
```
Float inwards

Quoting the source code,

The main purpose of floatInwards is floating into branches of a case, so that we don't allocate things, save them on the stack, and then discover that they aren't needed in the chosen branch.

As an example, suppose we had this expression:
```
 let x = big in
     case v of
         True -> x + 1
         False -> 0
```
If v evaluates to False, then by allocating x, which is presumably some big thunk, we have wasted time and space. Floating inwards fixes this, producing this:
```
 case v of
     True -> let x = big in x + 1
     False -> let x = big in 0
```
, which is subsequently replaced by the simplifier with
```
 case v of
     True -> big + 1
     False -> 0
```
This paper, although covering other topics, gives a fairly clear introduction. Note that despite their names, floating in and floating out don't get in an infinite loop for two reasons:
1. Float in floats lets into case statements, while float out deals with functions.
2. There is a fixed order of passes, so they shouldn't be alternating infinitely.
Demand analysis

Demand analysis, or strictness analysis is less of a transformation and more, like the name suggests, of an information gathering pass. The compiler finds functions that always evaluate their arguments (or at least some of them), and passes those arguments using call-by-value, instead of call-by-need. Since you get to evade the overheads of thunks, this is often much faster. Many performance problems in Haskell arise from either this pass failing, or code simply not being strict enough. A simple example is the difference between using foldr, foldl, and foldl' to sum a list of integers - the first causes stack overflow, the second causes heap overflow, and the last runs fine, because of strictness. This is probably the easiest to understand and best documented of all of these. I believe that polymorphism and CPS code often defeat this.
Worker Wrapper binds

The basic idea of the worker/wrapper transformation is to do a tight loop on a simple structure, converting to and from that structure at the ends. For example, take this function, which calculates the factorial of a number.
```
 factorial :: Int -> Int
 factorial 0 = 1
 factorial n = n * factorial (n - 1)
```
Using the definition of Int in GHC, we have
```
 factorial :: Int -> Int
 factorial (I# 0#) = I# 1#
 factorial (I# n#) = I# (n# *# case factorial (I# (n# -# 1#)) of
     I# down# -> down#)
```
Notice how the code is covered in I#s? We can remove them by doing this:
```
 factorial :: Int -> Int
 factorial (I# n#) = I# (factorial# n#)

 factorial# :: Int# -> Int#
 factorial# 0# = 1#
 factorial# n# = n# *# factorial# (n# -# 1#)
```
Although this specific example could have also been done by SpecConstr, the worker/wrapper transformation is very general in the things it can do.
Common sub-expression

This is another really simple optimization that is very effective, like strictness analysis. The basic idea is that if you have two expressions that are the same, they will have the same value. For example, if fib is a Fibonacci number calculator, CSE will transform
```
 fib x + fib x
```
into
```
 let fib_x = fib x in fib_x + fib_x
```
which cuts the computation in half. Unfortunately, this can occasionally get in the way of other optimizations. Another problem is that the two expressions have to be in the same place and that they have to be syntactically the same, not the same by value. For example, CSE won't fire in the following code without a bunch of inlining:
```
 x = (1 + (2 + 3)) + ((1 + 2) + 3)
 y = f x
 z = g (f x) y
```
However, if you compile via llvm, you may get some of this combined, due to its Global Value Numbering pass.
Liberate case

This seems to be a terribly documented transformation, besides the fact that it can cause code explosion. Here is a reformatted (and slightly rewritten) version of the little documentation I found:

This module walks over Core, and looks for case on free variables. The criterion is: if there is a case on a free variable on the route to the recursive call, then the recursive call is replaced with an unfolding. For example, in
```
 f = \ t -> case v of V a b -> a : f t
```
the inner f is replaced. to make
```
 f = \ t -> case v of V a b -> a : (letrec f = \ t -> case v of V a b -> a : f t in f) t
```
Note the need for shadowing. Simplifying, we get
```
 f = \ t -> case v of V a b -> a : (letrec f = \ t -> a : f t in f t)
```
This is better code, because a is free inside the inner letrec, rather than needing projection from v. Note that this deals with free variables, unlike SpecConstr, which deals with arguments that are of known form.

See below for more information about SpecConstr.
SpecConstr - this transforms programs like
```
 f (Left x) y = somthingComplicated1
 f (Right x) y = somethingComplicated2
```
into
```
 f_Left x y = somethingComplicated1
 f_Right x y = somethingComplicated2

 {-# INLINE f #-}
 f (Left x) = f_Left x
 f (Right x) = f_Right x
```
As an extended example, take this definition of last:
```
 last [] = error "last: empty list"
 last (x:[]) = x
 last (x:x2:xs) = last (x2:xs)
```
We first transform it to
```
 last_nil = error "last: empty list"
 last_cons x [] = x
 last_cons x (x2:xs) = last (x2:xs)

 {-# INLINE last #-}
 last [] = last_nil
 last (x : xs) = last_cons x xs
```
Next, the simplifier runs, and we have
```
 last_nil = error "last: empty list"
 last_cons x [] = x
 last_cons x (x2:xs) = last_cons x2 xs

 {-# INLINE last #-}
 last [] = last_nil
 last (x : xs) = last_cons x xs
```
Note that the program is now faster, as we are not repeatedly boxing and unboxing the front of the list. Also note that the inlining is crucial, as it allows the new, more efficient definitions to actually be used, as well as making recursive definitions better.

SpecConstr is controlled by a number of heuristics. The ones mentioned in the paper are as such:
1. The lambdas are explicit and the arity is a.
2. The right hand side is "sufficiently small," something controlled by a flag.
3. The function is recursive, and the specializable call is used in the right hand side.
4. All of the arguments to the function are present.
5. At least one of the arguments is a constructor application.
6. That argument is case-analysed somewhere in the function.
However, the heuristics have almost certainly changed. In fact, the paper mentions an alternative sixth heuristic:

Specialise on an argument x only if x is only scrutinised by a case, and is not passed to an ordinary function, or returned as part of the result.

这是一个非常小的文件（12行），可能没有触发太多优化（尽管我认为它们都做了）。这也不告诉你为什么选择了这些传递和为什么按照那个顺序放置它们。