优化Fortran子程序

Question

优化Fortran子程序

3

我已经为快速的 xoroshiro128plus 伪随机数生成器写了一个最小实现，用 Fortran 替换内置的 random_number。这个实现非常快（比 random_number 快 4 倍），而且质量对于我的目的来说足够好，我不会在密码应用中使用它。

我的问题是如何优化这个子程序以从编译器中获得最后一滴性能，即使只有 10% 的提升也会受到赞赏。这个子程序将在长时间模拟中的紧密循环中使用。我更感兴趣的是一次生成单个随机数，而不是一次生成大向量或 nD 数组。

下面是一个测试程序，让您了解我的子程序如何使用：

program test_xoroshiro128plus
   implicit none
   integer, parameter :: n = 10000
   real*8  :: A(n,n)
   integer :: i, j, t0, t1, count_rate, count_max

   call system_clock(t0, count_rate, count_max)
   do j = 1,n
      do i = 1,n
         call drand128(A(i,j))
      end do
   end do
   ! call drand128(A)  ! works also with 2D 
   call system_clock(t1)

   print *, "Time :", real(t1-t0)/count_rate
   print *, "Mean :", sum(A)/size(A), char(10), A(1:2,1:3)

 contains

   impure elemental subroutine drand128(r)
      real*8, intent(out) :: r
      integer*8 :: s0 = 113, s1 = 19937
      s1 = xor(s0,s1)
      s0 = xor(xor(ior(ishft(s0,55), ishft(s0,-9)),s1), ishft(s1,14))
      s1 = ior(ishft(s1,36), ishft(s1,-28))
      r = ishft(s0+s1, -1) / 9223372036854775808.d0
   end 

end program

- AboAmmar

@Steve -- 每次执行程序时都会生成相同的数字序列（即没有随机性），random_number也是如此，除非您发出random_seed()命令，例如（仅适用于ifort），我也可以轻松地为我的生成器设置种子。我的观点主要是关于速度的，我可以使用随附MKL的PRNG，但它只适用于长向量，我需要一次一个rand。此外，我的drand128与MKL的vdrnguniform具有相同的速度，而无需包含大型模块或设置奇怪的参数。 - AboAmmar

@HighPerformanceMark -- 是的，我曾经生成过一个临时的长向量来产生随机数，并像temp(i)一样每次取出一个随机数，但这个策略在运行时间和内存使用方面都证明不够高效。当然，我是指MKL的vdrnguniform，这是我尝试过的最快的方法。 - AboAmmar

@Steve，所有的伪随机数生成器都不是真正的随机数，对于我的应用程序来说种子并不重要，有时候可重复性更为重要。此外，请查看链接页面中的数字（http://xoshiro.di.unimi.it/），`xoroshiro128+` 不仅是最快的 PRNG 之一，也是最高质量的 PRNG 之一。Fortran 的内置函数使用 Mersenne Twister 算法，其速度只有 xoroshiro128+ 的四分之一。 - AboAmmar

1

@boAmmar，我从未说过任何关于真正随机数的事情。我只是纠正了你的说法，即如果没有调用random_seed，则random_number始终返回相同的序列。这是处理器相关的行为。不，Fortran的内置函数并不使用Mersenne Twister。每个Fortran供应商都使用他们认为最好的算法。gfortran很久以前使用了MT，但由于其质量差而将其删除。然后，gfortran使用了4个独立的KISS生成器（KISS指Marsaglia prng）。现在，gfortran使用Vigna的xorshift prng之一。 - Steve

1

FYI，Gfortran现在使用与此相关的随机数生成器。运行库实现了xorshift1024随机数生成器（RNG）。该生成器具有2^{1024}-1的周期，当使用多个线程时，每个线程最多可以生成2^{512}个随机数，在出现任何别名之前。https://gcc.gnu.org/onlinedocs/gfortran/RANDOM_005fNUMBER.html - Vladimir F Героям слава

显示剩余7条评论

2个回答

1

好的，这是我的翻译。首先，我将其转换为函数 - 在x64或类似ABI函数返回浮点值时，使用寄存器中的In比参数传输快得多。其次，用乘法替换了最终的除法，尽管英特尔编译器可能会为您完成。

时间，Intel i7 6820，WSL，Ubuntu 18.04：

before -   0.850000024
after  -   0.601000011

GNU Fortran 7.3.0，命令行

gfortran -std=gnu -O3 -ffast-math -mavx2 /mnt/c/Users/kkk/Documents/CPP/a.for

代码

  program test_xoroshiro128plus
  implicit none
  integer, parameter :: n = 10000
  real*8  :: A(n,n)
  integer :: i, j, t0, t1, count_rate, count_max

  call system_clock(t0, count_rate, count_max)
  do j = 1,n
     do i = 1,n
        A(i,j) = drand128()
     end do
  end do
  A = drand128()  ! works also with 2D
  call system_clock(t1)

  print *, "Time :", real(t1-t0)/count_rate
  print *, "Mean :", sum(A)/size(A), char(10), A(1:2,1:3)

  contains

  impure elemental real*8 function drand128()
     real*8, parameter :: c = 1.0d0/9223372036854775808.d0
     integer*8 :: s0 = 113, s1 = 19937
     s1 = xor(s0,s1)
     s0 = xor(xor(ior(ishft(s0,55), ishft(s0,-9)),s1), ishft(s1,14))
     s1 = ior(ishft(s1,36), ishft(s1,-28))
     drand128 = ishft(s0+s1, -1) * c
  end

  end program

- Severin Pappadeux

2

我在测试中没有看到任何性能差异（请参见我的答案）。这两个似乎表现完全相同，你有检查过优化的汇编吗？当内联时，优化似乎相当容易。 - Vladimir F Героям слава

@VladimirF 不，我还没有检查汇编代码，可能会在周末进行。然而，根据我描述的设置和给出的原始测试时间，我清楚地看到了差异。 - Severin Pappadeux

一定要像我一样设置循环，让CPU核心意识到需要全速运行。只有在此之后，速度比较才有意义。 - Vladimir F Героям слава

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Vladimir F Героям слава · Accepted Answer

我现在才意识到您询问的是这个特定的伪随机数生成器。我自己在Fortran中使用它：https://bitbucket.org/LadaF/elmm/src/eb5b54b9a8eb6af158a38038f72d07865fe23ee3/src/rng_par_zig.f90?at=master&fileviewer=file-view-default

我的代码比您的更慢，因为它调用了几个子程序，并且旨在更加通用。但让我们尝试将我使用的代码压缩到一个子程序中。

所以，让我们比较您的代码、@SeverinPappadeux优化后的代码和我的优化代码，使用Gfortran 4.8.5测试性能。

> gfortran -cpp -O3 -mtune=native xoroshiro.f90 

 Time drand128 sub:   1.80900002    
 Time drand128 fun:   1.80900002    
 Time rng_uni:   1.32900000

代码在这里，记得让CPU启动，k循环的第一次迭代只是垃圾代码!!!

program test_xoroshiro128plus
   use iso_fortran_env       
   implicit none
   integer, parameter :: n = 30000
   real*8  :: A(n,n)
   real*4  :: B(n,n)
   integer :: i, j, k, t0, t1, count_rate, count_max       

   integer(int64) :: s1 = int(Z'1DADBEEFBAADD0D0', int64), s2 = int(Z'5BADD0D0DEADBEEF', int64)

!let the CPU spin-up                                           
do k = 1, 3                                           
   call system_clock(t0, count_rate, count_max)
   do j = 1,n
      do i = 1,n
         call drand128(A(i,j))
      end do
   end do
   ! call drand128(A)  ! works also with 2D 
   call system_clock(t1)

   print *, "Time drand128 sub:", real(t1-t0)/count_rate

   call system_clock(t0, count_rate, count_max)
   do j = 1,n
      do i = 1,n
         A(i,j) = drand128_fun()
      end do
   end do
   ! call drand128(A)  ! works also with 2D 
   call system_clock(t1)

   print *, "Time drand128 fun:", real(t1-t0)/count_rate


   call system_clock(t0, count_rate, count_max)
   do j = 1,n
      do i = 1,n
         call rng_uni(A(i,j))
      end do
   end do
   call system_clock(t1)

   print *, "Time rng_uni:", real(t1-t0)/count_rate
end do

   print *, "Mean :", sum(A)/size(A), char(10), A(1:2,1:3)

 contains

   impure elemental subroutine drand128(r)
      real*8, intent(out) :: r
      integer*8 :: s0 = 113, s1 = 19937
      s1 = xor(s0,s1)
      s0 = xor(xor(ior(ishft(s0,55), ishft(s0,-9)),s1), ishft(s1,14))
      s1 = ior(ishft(s1,36), ishft(s1,-28))
      r = ishft(s0+s1, -1) / 9223372036854775808.d0
   end 

   impure elemental real*8 function drand128_fun()
     real*8, parameter :: c = 1.0d0/9223372036854775808.d0
     integer*8 :: s0 = 113, s1 = 19937
     s1 = xor(s0,s1)
     s0 = xor(xor(ior(ishft(s0,55), ishft(s0,-9)),s1), ishft(s1,14))
     s1 = ior(ishft(s1,36), ishft(s1,-28))
     drand128_fun = ishft(s0+s1, -1) * c
  end

  impure elemental subroutine rng_uni(fn_val)
    real(real64), intent(inout) ::  fn_val
    integer(int64) :: ival

    ival = s1 + s2

    s2 = ieor(s2, s1)
    s1 = ieor( ieor(rotl(s1, 24), s2), shiftl(s2, 16))
    s2 = rotl(s2, 37)    

    ival  = ior(int(Z'3FF0000000000000',int64), shiftr(ival, 12))
    fn_val = transfer(ival, 1.0_real64) - 1;    
  end subroutine

  function rotl(x, k)
    integer(int64) :: rotl
    integer(int64) :: x
    integer :: k

    rotl = ior( shiftl(x, k), shiftr(x, 64-k))
  end function    

end program

主要区别应该来自于更快、更好的整数转换为实数的方法。详情请见：http://experilous.com/1/blog/post/perfect-fast-random-floating-point-numbers#half-open-range 如果你感到无聊，可以尝试手动内联rotl()，但我相信编译器会处理好这个问题。