如何优化这个Langton蚂蚁模拟器？

Question

如何优化这个Langton蚂蚁模拟器？

3

我正在编写一个 Langton's ant 模拟器（使用规则字符串 RLR），并尝试优化其速度。以下是当前的相关代码：

#define AREA_X 65536
#define AREA_Y 65536
#define TURN_LEFT 3
#define TURN_RIGHT 1
int main()
{
  uint_fast8_t* state;
  uint_fast64_t ant=((AREA_Y/2)*AREA_X) + (AREA_X/2);
  uint_fast8_t ant_orientation=0;
  uint_fast8_t two_pow_five=32;
  uint32_t two_pow_thirty_two=0;/*not fast, relying on exact width for overflow*/
  uint_fast8_t change_orientation[4]={0, TURN_RIGHT, TURN_LEFT, TURN_RIGHT};
  int_fast64_t* move_ant={AREA_X, 1, -AREA_X, -1};
  ... initialise empty state
  while(1)
  {
    while(two_pow_five--)/*removing this by doing 32 steps per inner loop, ~16% longer*/
    {
      while(--two_pow_thirty_two)
      {
        /*one iteration*/
        /* 54 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + (117>>((++state[ant])*2 )) )&3;
        state[ant] = (36 >> (state[ant] *2) ) & 3;
        ant+=move_ant[ant_orientation];
        */

        /* 47 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3;
        state[ant] += (state[ant]==2)?-2:1;
        ant+=move_ant[ant_orientation];
        */

        /* 46 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3;
        if(state[ant]==2)
        {
          --state[ant];
          --state[ant];
        }
        else
          ++state[ant];
        ant+=move_ant[ant_orientation];
        */

        /* 44 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + ((++state[ant])==2?3:1) )&3;
        if(state[ant]==3)state[ant]=0;
        ant+=move_ant[ant_orientation];
        */

        // 37 seconds for init + 2^32 steps
        // handle every situation with nested switches and constants
        switch(ant_orientation)
        {
          case 0:
            switch(state[ant])
            {
              case 0:
                ant_orientation=1;
                state[ant]=1;
                ++ant;
                break;
              case 1:
                ant_orientation=3;
                state[ant]=2;
                --ant;
                break;
              case 2:
                ant_orientation=1;
                state[ant]=0;
                ++ant;
                break;
            }
            break;
          case 1:
            switch(state[ant])
            {
              ...
            }
            break;
          case 2:
            switch(state[ant])
            {
              ...
            }
            break;
          case 3:
            switch(state[ant])
            {
              ...
            }
            break;
        }

      }
    }
    two_pow_five=32;
    ... dump to file every 2^37 steps
  }
  return 0;
}

我有两个问题：

我已经尝试了c语言的最佳优化，通过试错测试，是否还有我没有利用的技巧？请尽量使用c语言讲解，尽管我可能会在某些时候尝试汇编。
是否有更好的建模方式来增加速度？

更多信息：可移植性不重要。我在64位Linux上使用gcc，i5-2500k和16 GB内存。目前的状态数组使用了4GB，程序可能会使用12GB内存。sizeof(uint_fast8_t)=1。故意不进行边界检查，从转储中手动发现容易出现损坏。

编辑:也许相反地，堆积 switch 语句而不是消除它们到目前为止产生了最佳效率。

编辑:我重新建模了问题，并想出了比每次迭代一个步骤更快的方法。以前，每个状态元素使用两个位描述 Langton's ant 网格中的单个单元格。新方法使用所有 8 位，并描述网格的 2×2 部分。每次迭代都会执行不同数量的步骤，方法是查找当前状态+方向的预计算值的步数、新方向和新状态。假设一切都是同样可能的，平均每次迭代需要2步。额外的好处是它只使用1/4的内存来建模相同的区域：

while(--iteration)
{
        // roughly 31 seconds per 2^32 steps
        table_offset=(state[ant]*24)+(ant_orientation*3);
        it+=twoxtwo_table[table_offset+0];
        state[ant]=twoxtwo_table[table_offset+2];
        ant+=move_ant2x2[(ant_orientation=twoxtwo_table[table_offset+1])];
}

我还没有试过对其进行优化，下一步要尝试的是消除偏移方程和查找，使用嵌套switch和类似以前的常数（但内部有648个情况而不是12个）。

- John Smith

你的双重大小写优化似乎是极致的 :D - huseyin tugrul buyukisik

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- huseyin tugrul buyukisik · Accepted Answer

或者，您可以使用一个单一的无符号字节常量作为人工寄存器，而不是分支：

value:   1  3  1  1
bits:   01 11 01 01 ---->101 decimal value for an unsigned byte
index    3  2  1  0  ---> get first 2 bits to get  "1"  (no shift)
                      --> get second 2 bits to get "1"  (shifting for 2 times)
                      --> get third 2 bits to get  "3"  (shifting for 4 times)
                      --> get last 2 bits to get   "1"  (shifting for 6 times)

 Then "AND" the result with  binary(11) or decimal(3) to get your value.

 (101>>( (++state[ant])*2  ) ) & 3 would give you the turnright or turnleft

 Example:
 ++state[ant]= 0:  ( 101>>( (0)*2 ) )&3  --> 101 & 3 = 1
 ++state[ant]= 1:  ( 101>>( (1)*2 ) )&3  --> 101>>2 & 3 = 1
 ++state[ant]= 2:  ( 101>>( (2)*2 ) )&3  --> 101>>4 & 3 = 3  -->turn left
 ++state[ant]= 3:  ( 101>>( (3)*2 ) )&3  --> 101>>6 & 3 = 1

 Maximum six-shifting + one-multiplication + one-"and" may be better.
 Dont forget constant can be auto-promoted so you may add some suffixes or something else.

由于您正在使用“unsigned int”进行%4模数运算，因此可以使用“and”操作。

  state[ant]=state[ant]&3; instead of state[ant]=state[ant]%4;

对于没有技术经验的编译器来说，这应该可以提高速度。

最困难的部分：模3

  C = A % B is equivalent to C = A – B * (A / B)
  We need  state[ant]%3
  Result = state[ant] - 3 * (state[ant]/3)

  state[ant]/3 is always <=1 for your valid direction states.
  Only when state[ant]  is 3 then state[ant]/3 is 1, other values give 0.
  When multiplied by 3, that part is 0 or 3 (only 3 when state[ant] is 3 otherwise 0)
  Result = state[ant] - (0 or 3)

  Lets look at all possibilities:

  state[ant]=0:  0 - 0  ---> 0   ----> 00100100 shifted by 0 times &3 --> 00000000
  state[ant]=1:  1 - 0  ---> 1   ----> 00100100 shifted by 2 times &3 --> 00000001
  state[ant]=2:  2 - 0  ---> 2   ----> 00100100 shifted by 4 times &3 --> 00000010
  state[ant]=3:  3 - 3  ---> 0   ----> 00100100 shifted by 6 times &3 --> 00000000

  00100100 is 36 in decimal.

  (36 >> (state[ant] *2) ) & 3 will give you state[ant]%3 for your valid states (0,1,2,3)

  Example: 

  state[ant]=0: 36 >> 0  --> 36 ----> 36& 3 ----> 0  satisfies 0%3
  state[ant]=1: 36 >> 2  --> 9 -----> 9 & 3 ----> 1  satisfies 1%3
  state[ant]=2: 36 >> 4  --> 2 -----> 2 & 3 ----> 2 satisfies  2%3
  state[ant]=3: 36 >> 6  --> 0 -----> 0 & 3 ----> 0 satisfies 3%3