我们大部分的CPU周期都用于涉及小矩阵的操作,所以我想知道是否有可能针对这种情况进行优化。考虑以下代码:
module Main where
import Numeric.LinearAlgebra.HMatrix
import Criterion.Main
data Matrix2x2 = Matrix2x2 {-# UNPACK #-} !Double !Double !Double !Double
mul2x2p :: Matrix2x2 -> Matrix2x2 -> Matrix2x2
mul2x2p (Matrix2x2 a1 b1 c1 d1) (Matrix2x2 a2 b2 c2 d2) =
Matrix2x2 (a1*a2 + b1*c2) (a1*b2 + b1*d2) (c1*a2 + d1*c2) (c1*b2 + d1*d2)
inv2x2 :: Matrix2x2 -> Matrix2x2
inv2x2 (Matrix2x2 a b c d) =
let detInv = a * d - b * c
in Matrix2x2 (d / detInv) (-b / detInv) (-c / detInv) (a / detInv)
add2x2 (Matrix2x2 a1 b1 c1 d1) (Matrix2x2 a2 b2 c2 d2) =
Matrix2x2 (a1+a2) (b1+b2) (c1+c2) (d1+d2)
hm1 = matrix 2 [1, 2, 3, 4]
hm2 = matrix 2 [5, 6, 7, 8]
pm1 = Matrix2x2 1 2 3 4
pm2 = Matrix2x2 5 6 7 8
main = defaultMain [
bgroup "matrix tests" [ bench "pure mult" $ whnf (mul2x2p pm1) pm2
, bench "hmatrix mult" $ whnf (hm1 <>) hm2
, bench "pure add" $ whnf (add2x2 pm1) pm2
, bench "hmatrix add" $ whnf (hm1 +) hm2
, bench "pure inv" $ whnf inv2x2 pm1
, bench "hmatrix inv" $ whnf inv hm1
]]
结果如下:
benchmarking matrix tests/pure mult
time 6.461 ns (6.368 ns .. 6.553 ns)
0.999 R² (0.998 R² .. 0.999 R²)
mean 6.482 ns (6.394 ns .. 6.594 ns)
std dev 345.1 ps (271.4 ps .. 477.3 ps)
variance introduced by outliers: 77% (severely inflated)
benchmarking matrix tests/hmatrix mult
time 180.6 ns (178.2 ns .. 183.1 ns)
0.999 R² (0.998 R² .. 0.999 R²)
mean 183.0 ns (180.6 ns .. 186.3 ns)
std dev 9.363 ns (7.405 ns .. 12.73 ns)
variance introduced by outliers: 71% (severely inflated)
benchmarking matrix tests/pure add
time 6.262 ns (6.223 ns .. 6.297 ns)
0.999 R² (0.999 R² .. 1.000 R²)
mean 6.281 ns (6.220 ns .. 6.355 ns)
std dev 235.0 ps (183.3 ps .. 321.0 ps)
variance introduced by outliers: 62% (severely inflated)
benchmarking matrix tests/hmatrix add
time 116.4 ns (115.0 ns .. 117.9 ns)
0.999 R² (0.998 R² .. 0.999 R²)
mean 116.3 ns (115.2 ns .. 117.7 ns)
std dev 4.176 ns (3.447 ns .. 5.150 ns)
variance introduced by outliers: 55% (severely inflated)
benchmarking matrix tests/pure inv
time 7.811 ns (7.718 ns .. 7.931 ns)
0.999 R² (0.998 R² .. 0.999 R²)
mean 7.895 ns (7.808 ns .. 7.988 ns)
std dev 296.4 ps (247.2 ps .. 358.3 ps)
variance introduced by outliers: 62% (severely inflated)
benchmarking matrix tests/hmatrix inv
time 908.5 ns (901.3 ns .. 916.6 ns)
0.999 R² (0.998 R² .. 0.999 R²)
mean 934.0 ns (917.6 ns .. 961.3 ns)
std dev 73.92 ns (50.53 ns .. 108.6 ns)
variance introduced by outliers: 84% (severely inflated)
我的问题是:
1)速度提升是真实存在的,还是由于基准测试过程中的人工因素造成的?
2)如果速度提升是真实存在的,是否存在已有的库可以处理1x1、2x2、3x3、4x4矩阵作为特殊情况?
3)如果没有,最好的方法是如何包装HMatrix,以便在矩阵较小时采用快速路径? GHC只能展开具有一个构造函数的记录。是否有一种自动生成我们代码不同版本的方式等。
example-test.cabal:
name: example-test
version: 0.1.0.0
build-type: Simple
cabal-version: >=1.10
executable example-test
main-is:
Main.hs
build-depends:
base >=4.7 && <4.8,
criterion,
hmatrix
default-language:
Haskell2010
ghc-options:
-H12G -O3 -optc-O3 -fllvm -rtsopts -threaded -fexcess-precision -j6 +RTS -N6 -RTS -fno-ignore-asserts -fcontext-stack=150
-- -fforce-recomp