为什么在x86-64 GCC上，__int128_t比long long更快？

Question

为什么在x86-64 GCC上，__int128_t比long long更快？

c++performancex86-64cpu-architectureinteger-division

45

这是我的测试代码：

#include <chrono>
#include <iostream>
#include <cstdlib>
using namespace std;

using ll = long long;

int main()
{
    __int128_t a, b;
    ll x, y;

    a = rand() + 10000000;
    b = rand() % 50000;
    auto t0 = chrono::steady_clock::now();
    for (int i = 0; i < 100000000; i++)
    {
        a += b;
        a /= b;
        b *= a;
        b -= a;
        a %= b;
    }
    cout << chrono::duration_cast<chrono::milliseconds>(chrono::steady_clock::now() - t0).count() << ' '
         << (ll)a % 100000 << '\n';

    x = rand() + 10000000;
    y = rand() % 50000;
    t0 = chrono::steady_clock::now();
    for (int i = 0; i < 100000000; i++)
    {
        x += y;
        x /= y;
        y *= x;
        y -= x;
        x %= y;
    }
    cout << chrono::duration_cast<chrono::milliseconds>(chrono::steady_clock::now() - t0).count() << ' '
         << (ll)x % 100000 << '\n';

    return 0;
}

这是测试结果：

$ g++ main.cpp -o main -O2
$ ./main
2432 1
2627 1

在x64 GNU / Linux上使用GCC 10.1.0进行编译，无论是使用-O2优化还是未经优化，__int128_t始终比long long稍快一些。

int和double都比long long快得多; long long已成为最慢的类型。

为什么会这样呢？

- xxhxx

2

我认为这与“long long”无关。如果您将“x”和“y”定义为“__int128_t”，您也会得到这样的差异 https://godbolt.org/z/1e1YeE - Alex Lop.

乱序执行会对结果产生多大影响？乍一看，这两个测试看起来完全独立，那么处理器是否可以自由地乱序执行它们？我提出这个问题是为了测试我对这个主题的理解是否太过幼稚。 - Rich

1

@Rich OOO不会同时执行两个循环，可能是由于循环代码内部的依赖关系，导致OOO在这里效率不高。 - Alex Lop.

2

@Rich：硬件OoO执行仅适用于短距离，其中“短”最多在Skylake上大约为224条指令（ROB大小：https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/）。而且这是沿着执行路径测量的，每次通过循环运行循环体。请参见[我的答案](//softwareengineering.stackexchange.com/a/350024)。将两个循环合并只有在像Transmeta Crusoe这样内部进行动态重新编译的非传统CPU才可能实现，而不是当前按执行顺序查看指令的CPU。 - Peter Cordes

2

但是，这个糟糕的基准测试没有进行任何预热，所以唯一能使它免受 CPU 频率和其他预热效应影响的是它运行了大量迭代，因此这只是九牛一毛。性能评估的惯用方式？此外，它通过执行与其他操作同样多的除法来强调除法性能。对于大多数用例来说非常不现实。 - Peter Cordes

显示剩余8条评论

2个回答

31

TL:DR: __int128 division helper functions internally end up doing an unsigned div reg64 (after some branching on the values being positive and the upper halves being 0). 64-bit div is faster on Intel CPUs than the signed idiv reg64 that GCC inlines for signed long long. Faster by enough to make up for all the extra overhead of the helper function, and extended-precision for the other operations.

You probably wouldn't see this effect on AMD CPUs: long long would be faster as expected because idiv r64 is similar enough in perf to div r64 there.

And unsigned long long is faster than unsigned __int128 even on Intel CPUs, for example on my i7-6700k (Skylake) at 3.9GHz (run under perf stat to be sure of CPU frequency during the test):

- 2097 (i128) vs. 2332 (i64) - your original test (run back-to-back for CPU freq warm-up) - 2075 (u128) vs. 1900 (u64) - unsigned versions. Slightly less branching in u128 division vs. i128, but major difference for i64 vs. u64 where the only difference is div vs. idiv.

Also, drawing any general conclusions from a very specific micro-benchmark like this would be a bad idea. It is interesting to dig into why exactly the extended-precision __int128 type manages to be faster in this division benchmark with positive numbers small enough to fit in a 32-bit integer, though.

你的基准测试过于偏向除法运算，每次迭代都执行两次除法运算（/和%），即使它比其他操作昂贵得多，在大多数代码中使用频率也要低得多。（例如：对整个数组求和，然后再除以数组长度得到平均值。）

你的基准测试也没有指令级并行处理：每个步骤都依赖于前一个步骤的数据。这会阻止自动矢量化或显示较窄类型的一些优点。

（还要注意避免预热效应，例如第一个计时区域直到CPU达到最大睿频才变慢。性能评估的惯用方法是什么？但这比你计时区域的几秒钟发生得快得多，所以这里不是问题。）

128位整数除法（尤其是有符号的）太复杂了，GCC不想内联，所以GCC会调用一个帮助函数__divti3或__modti3。（TI = tetra-integer，GCC对4倍于int大小的整数的内部名称。）这些函数在GCC-internals manual中有文档说明。

你可以在the Godbolt compiler-explorer上看到编译器生成的汇编代码。即128位加法使用add/adc，乘法使用一个mul完整乘以低半部分，和两个非扩展的imul交叉乘积。是的，这些比int64_t的单指令等效性要慢。

但是Godbolt不会显示libgcc辅助函数的汇编代码。它甚至不在“编译为二进制文件”和反汇编模式下（而不是通常的编译器汇编文本输出）中对其进行反汇编，因为它动态链接了libgcc_s而不是libgcc.a。 扩展精度有符号除法通过必要时取反并对64位块进行无符号除法来完成，然后如有必要修正结果的符号。 如果两个输入都很小且为正，则不需要实际取反（只需测试和分支）。还有快速路径可用于小数值（高半部分除数= 0，并且商将适合64位），这在这种情况下是成立的。最终执行__divti3的路径如下：

这是我在Arch GNU/Linux系统上使用gcc-libs 10.1.0-2编译，并使用“g++ -g -O3 int128-bench.cpp -o int128-bench.O3”命令手动单步调用到__divti3的结果。

# Inputs: dividend = RSI:RDI, divisor = RCX:RDX
# returns signed quotient RDX:RAX
|  >0x7ffff7c4fd40 <__divti3>       endbr64             # in case caller was using CFE (control-flow enforcement), apparently this instruction has to pollute all library functions now.  I assume it's cheap at least in the no-CFE case.
│   0x7ffff7c4fd44 <__divti3+4>     push   r12
│   0x7ffff7c4fd46 <__divti3+6>     mov    r11,rdi
│   0x7ffff7c4fd49 <__divti3+9>     mov    rax,rdx                                                                                                       │   0x7ffff7c4fd4c <__divti3+12>    xor    edi,edi
│   0x7ffff7c4fd4e <__divti3+14>    push   rbx
│   0x7ffff7c4fd4f <__divti3+15>    mov    rdx,rcx
│   0x7ffff7c4fd52 <__divti3+18>    test   rsi,rsi      # check sign bit of dividend (and jump over a negation)
│   0x7ffff7c4fd55 <__divti3+21>    jns    0x7ffff7c4fd6e <__divti3+46>
... taken branch to
|  >0x7ffff7c4fd6e <__divti3+46>    mov    r10,rdx
│   0x7ffff7c4fd71 <__divti3+49>    test   rdx,rdx      # check sign bit of divisor (and jump over a negation), note there was a mov rdx,rcx earlier
│   0x7ffff7c4fd74 <__divti3+52>    jns    0x7ffff7c4fd86 <__divti3+70>
... taken branch to
│  >0x7ffff7c4fd86 <__divti3+70>    mov    r9,rax
│   0x7ffff7c4fd89 <__divti3+73>    mov    r8,r11
│   0x7ffff7c4fd8c <__divti3+76>    test   r10,r10      # check high half of abs(divisor) for being non-zero
│   0x7ffff7c4fd8f <__divti3+79>    jne    0x7ffff7c4fdb0 <__divti3+112>  # falls through: small-number fast path
│   0x7ffff7c4fd91 <__divti3+81>    cmp    rax,rsi      # check that quotient will fit in 64 bits so 128b/64b single div won't fault: jump if (divisor <= high half of dividend)
│   0x7ffff7c4fd94 <__divti3+84>    jbe    0x7ffff7c4fe00 <__divti3+192>  # falls through: small-number fast path
│   0x7ffff7c4fd96 <__divti3+86>    mov    rdx,rsi
│   0x7ffff7c4fd99 <__divti3+89>    mov    rax,r11
│   0x7ffff7c4fd9c <__divti3+92>    xor    esi,esi
│  >0x7ffff7c4fd9e <__divti3+94>    div    r9                #### Do the actual division ###
│   0x7ffff7c4fda1 <__divti3+97>    mov    rcx,rax
│   0x7ffff7c4fda4 <__divti3+100>   jmp    0x7ffff7c4fdb9 <__divti3+121>
...taken branch to
│  >0x7ffff7c4fdb9 <__divti3+121>   mov    rax,rcx
│   0x7ffff7c4fdbc <__divti3+124>   mov    rdx,rsi
│   0x7ffff7c4fdbf <__divti3+127>   test   rdi,rdi     # check if the result should be negative
│   0x7ffff7c4fdc2 <__divti3+130>   je     0x7ffff7c4fdce <__divti3+142>
... taken branch over a neg rax / adc rax,0 / neg rdx
│  >0x7ffff7c4fdce <__divti3+142>   pop    rbx
│   0x7ffff7c4fdcf <__divti3+143>   pop    r12
│   0x7ffff7c4fdd1 <__divti3+145>   ret
... return back to the loop body that called it

自IvyBridge开始，Intel CPU具有零延迟的mov指令，因此所有开销不会显著恶化关键路径延迟（即瓶颈）。或者至少不足以弥补idiv和div之间的差异。

分支由分支预测和推测执行处理，在实际输入寄存器值相同时才检查预测。分支每次都走同一条路，因此分支预测很容易学习。由于除法非常缓慢，乱序执行有足够的时间赶上。

在Intel CPU上，64位操作数大小的整数除法非常缓慢，即使数字实际上很小并且适合32位整数，带符号整数除法的额外微码也更加昂贵。

例如，在我的Skylake（i7-6700k）上，https://uops.info/ 显示（表搜索结果）

idiv r64 的前端为56个uop，延迟为41到95个周期（从除数到商，这是我认为相关的情况）。
div r64 的前端为33个uop，延迟为35到87个周期。（对于同一延迟路径）。

最佳延迟情况发生在小商或小被除数等情况下，我总是记不住哪个。

类似于GCC在128位除法方面在64位方面进行的分支，我认为CPU微码在狭窄操作方面进行了64位除法，可能是仅有10个uop的32位除法（带符号或无符号），延迟要低得多。（Ice Lake改进了分频器，因此64位除法与32位除法的速度差别不大。）

这就是为什么您发现此基准测试中long long比int慢得多的原因。在许多情况下，它大致相同，如果涉及内存带宽或SIMD，则速度为一半。（每个128位矢量宽度只有2个元素，而不是4个）。 AMD CPU更有效地处理64位操作数大小，性能仅取决于实际值，因此对于具有相同数字的div r32和div r64大致相同。 顺便提一下，实际值往往是类似于a=1814246614 / b=1814246613 = 1，然后a=1 % b=1814246612（每次迭代b减1）。只测试商为1的除法似乎非常愚蠢。（第一次迭代可能不同，但我们在第二次及以后进入这种状态。）

现代CPU上整数运算的性能除了除法之外并不依赖数据。（当然，如果存在编译时常量允许发出不同的汇编代码，则除以常量时使用计算时求得的乘法逆元要便宜得多。）

关于double：请参阅浮点数除法与浮点数乘法以进行除法和乘法的比较。FP除法通常更难避免，并且其性能在更多情况下都很重要，因此它被处理得更好。