GHC中的跨模块优化

Question

GHC中的跨模块优化

performancehaskelloptimizationhaskell-criterion

13

我有一个非递归函数，用于计算最长公共子序列，经过测试表现良好（使用ghc 7.6.1编译并使用-O2 -fllvm标志），如果在同一模块中使用Criterion进行测量。另一方面，如果我将该函数转换为模块，仅导出该函数（按照这里的建议），然后再次使用Criterion进行测量，则会出现约2倍的减速（如果我将 Criterion 测试移回定义函数的模块，则问题消失）。我尝试使用INLINE pragma标记该函数，但它对跨模块性能测量没有任何影响。

在我看来，GHC可能正在执行严格性分析，当函数和主要函数（从中可以调用该函数）位于同一模块中时，这种分析效果很好，但是当它们分开时则不行。我希望能指导如何将该功能模块化，以便在从其他模块调用时能够良好地运行。有关代码过长无法在此粘贴-如果您想尝试，请在此处查看。以下是我尝试做的小例子（包括代码片段）：

-- Function to find longest common subsequence given unboxed vectors a and b
-- It returns indices of LCS in a and b
lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int)
lcs a b | (U.length a > U.length b) = lcsh b a True
        | otherwise = lcsh a b False

-- This section below measures performance of lcs function - if I move it to 
-- a different module, performance degrades ~2x - mean goes from ~1.25us to ~2.4us
-- on my test machine
{-- 
config :: Config
config = defaultConfig  { cfgSamples = ljust 100 }

a = U.fromList ['a'..'j'] :: Vector Char
b = U.fromList ['a'..'k'] :: Vector Char

suite :: [Benchmark]
suite = [
          bench "lcs 10" $ whnf (lcs a) b
        ]

main :: IO()
main = defaultMainWith config (return ()) suite
--}

- Sal

尝试使用INLINEABLE。它可能会更好地工作。 - Carl

@Carl，尝试了lcs函数。仍然一样。 - Sal

5

我猜测问题在于当所有内容都在一个模块中时，GHC可以将类型变量 a 专门化为 Char ，因为它从未与任何其他类型一起使用，这样可以消除类型类的开销。你可以尝试玩弄一下 SPECIALIZE 修饰符（或手动更改为 Char ），看看是否有任何效果。 - hammar

我自己也多次看到了相同的情况，每一次都会想：“应该有一个程序可以将多模块Haskell程序转换为单模块，并对两者进行基准测试。”这将使我们更容易知道是否应该考虑调整SPECIALIZE/INLINE/INLINEABLE。类似的程序存在吗？它本质上应该是名称混淆，对吗？ - gspr

我仍然希望有一个编译指示可以告诉 GHC 根据特定参数的值在编译时进行函数特化。这将极大地帮助高阶函数的高性能使用。 - Carl

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Daniel Fischer · Accepted Answer

hammar 是正确的，重要的问题是编译器可以在同时看到代码和使用lcs的类型时对代码进行特化优化，从而针对特定的类型生成代码。

如果编译器不知道代码将在哪种类型上使用，则只能生成多态代码。这对性能来说很糟糕 - 我很惊讶这里只有大约2倍的差异。多态代码意味着对于许多操作都需要进行类型类查找，至少这使得无法内联所查找的函数或常量折叠大小[例如用于非装箱数组/向量访问]。

如果需要将实现和使用分开到不同模块中，并且想要获得与单个模块相当的性能，则必须使需要特化的代码在使用点可见（或者，如果您在实现点处知道所需类型，则在那里特化，{-# SPECIALISE foo :: Char -> Int, foo :: Bool -> Integer #-}等）。

通常通过将函数标记为{-# INLINABLE #-}，在接口文件中公开展开来使代码在使用点可见。

我尝试使用INLINE pragma标记函数，但对跨模块性能测量没有任何影响。

只需标记即可。

lcs :: (U.Unbox a, Eq a) => Vector a -> Vector a -> (Vector Int,Vector Int)
lcs a b | (U.length a > U.length b) = lcsh b a True
        | otherwise = lcsh a b False

INLINE或INLINABLE并没有什么区别。当然，那个函数很简单，编译器会展开它，因为它非常小。即使它的展开没有被暴露，差异也是无法测量的。

你还需要暴露执行实际工作的函数的展开，至少是多态函数的展开，比如lcsh、findSnakes、gridWalk和cmp（cmp在这里非常关键，但其他函数也必须要被暴露以便于从中调用专门的cmp）。

将它们设为INLINABLE后，在分离模块的情况下，差异就会显现出来。

$ ./diffBench 
warming up
estimating clock resolution...
mean is 1.573571 us (320001 iterations)
found 2846 outliers among 319999 samples (0.9%)
  2182 (0.7%) high severe
estimating cost of a clock call...
mean is 40.54233 ns (12 iterations)

benchmarking lcs 10
mean: 1.628523 us, lb 1.618721 us, ub 1.638985 us, ci 0.950
std dev: 51.75533 ns, lb 47.04237 ns, ub 58.45611 ns, ci 0.950
variance introduced by outliers: 26.787%
variance is moderately inflated by outliers

以及单个模块的情况

$ ./oneModule 
warming up
estimating clock resolution...
mean is 1.726459 us (320001 iterations)
found 2092 outliers among 319999 samples (0.7%)
  1608 (0.5%) high severe
estimating cost of a clock call...
mean is 39.98567 ns (14 iterations)

benchmarking lcs 10
mean: 1.523183 us, lb 1.514157 us, ub 1.533071 us, ci 0.950
std dev: 48.48541 ns, lb 44.43230 ns, ub 55.04251 ns, ci 0.950
variance introduced by outliers: 26.791%
variance is moderately inflated by outliers

相对较小。