堆栈分配特性（性能）

Question

堆栈分配特性（性能）

7

在我进行小型性能问题调查时，我注意到了一个有趣的堆栈分配特性，下面是用于测量时间的模板：

#include <chrono>
#include <iostream>

using namespace std;
using namespace std::chrono;

int x; //for simple optimization suppression
void foo();

int main()
{   
    const size_t n = 10000000; //ten millions
    auto start = high_resolution_clock::now();

    for (size_t i = 0; i < n; i++)
    {
        foo();
    }

    auto finish = high_resolution_clock::now();
    cout << duration_cast<milliseconds>(finish - start).count() << endl;
}

现在重点是foo()的实现，每个实现将分配总共500000个整数：

Allocated in one chunk:

void foo()
{
    const int size = 500000;
    int a1[size];

    x = a1[size - 1];
}

Result: 7.3 seconds;

Allocated in two chunks:

void foo()
{
    const int size = 250000;
    int a1[size];
    int a2[size];

    x = a1[size - 1] + a2[size - 1];
}

Result: 3.5 seconds;

Allocated in four chunks:

void foo()
{
    const int size = 125000;
    int a1[size];
    int a2[size];
    int a3[size];
    int a4[size];

    x = a1[size - 1] + a2[size - 1] +
        a3[size - 1] + a4[size - 1];
}

Result: 1.8 seconds.

等等等...我把它分成了16块，结果时间为0.38秒。

请向我解释一下，这是为什么以及如何发生的？
我使用的是MSVC 2013（v120），发布版本。

更新:
我的机器是x64平台。我是用Win32平台编译的。
当我使用x64平台编译时，在所有情况下都会产生大约40ms的结果。
为什么平台选择会如此影响？

- MrPisarik

你的电脑配置是什么？编译器版本和编译标志是什么？ - WhiZTiM

@WhiZTiM，我试图避免优化:) 你能建议一些改进措施来精确避免编译器优化和缓存未命中吗？ - MrPisarik

3

иЇ·е‹їеЏ‘еёѓе…·жњ‰йќћж ‡е‡†void mainзљ„д»Јз ЃгЂ‚FTFY. - Cheers and hth. - Alf

1

您IP地址为143.198.54.68，由于运营成本限制，当前对于免费用户的使用频率限制为每个IP每72小时10次对话，如需解除限制，请点击左下角设置图标按钮（手机用户先点击左上角菜单按钮）。 - Sebastian Lenartowicz

2

编译器的作者们喜欢未定义行为，因为它提供了许多使代码运行更快的方法。您正在读取从未被写入过的值。优化器注意到这一点，知道任何值都足够好，所以只读取a1[0]。这反过来又允许消除a2、a3和a4。这使得堆栈帧变小。这使得它在_chkstk上花费的时间更少。因此，执行时间与“size”的值成比例。 - Hans Passant

显示剩余6条评论

2个回答

1

你应该查看生成的汇编代码，以了解编译器对代码的实际处理情况。对于gcc/clang/icc，你可以使用Matt Godbolt's Compiler Explorer。

由于未定义行为，clang会将所有内容优化掉，结果是（foo - 第一个版本，foo2 - 第二个版本：

foo:                                    # @foo
        retq

foo2:                                   # @foo2
        retq

icc 对这两个版本处理方式非常相似：

foo:
        pushq     %rbp                                          #4.1
        movq      %rsp, %rbp                                    #4.1
        subq      $2000000, %rsp                                #4.1
        movl      -4(%rbp), %eax                                #8.9
        movl      %eax, x(%rip)                                 #8.5
        leave                                                   #10.1
        ret                                                     #10.1

foo2:
        pushq     %rbp                                          #13.1
        movq      %rsp, %rbp                                    #13.1
        subq      $2000000, %rsp                                #13.1
        movl      -1000004(%rbp), %eax                          #18.9
        addl      -4(%rbp), %eax                                #18.24
        movl      %eax, x(%rip)                                 #18.5
        leave                                                   #19.1
        ret

gcc 会针对不同版本生成不同的汇编代码。版本6.1生成的代码将显示类似于您的实验结果的行为：

foo:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $2000016, %rsp
        movl    1999996(%rsp), %eax
        movl    %eax, x(%rip)
        leave
        ret
foo2:
        pushq   %rbp
        movl    $1000016, %edx  #only the first array is allocated
        movq    %rsp, %rbp
        subq    %rdx, %rsp
        leaq    3(%rsp), %rax
        subq    %rdx, %rsp
        shrq    $2, %rax
        movl    999996(,%rax,4), %eax
        addl    999996(%rsp), %eax
        movl    %eax, x(%rip)
        leave
        ret

因此，了解差异的唯一方法是查看由您的编译器生成的汇编代码，其他一切都只是猜测。

- ead

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- 1201ProgramAlarm · Accepted Answer

从VS2015 Update 3反汇编结果来看，在foo的2个和4个数组版本中，编译器会优化未使用的数组，只为每个函数保留1个数组的堆栈空间。由于后面的函数具有较小的数组，因此这需要更少的时间。对x的赋值读取了4个数组中相同的内存位置。(由于这些数组未初始化，从它们中读取是未定义的行为)。如果不对代码进行优化，则会读取2个或4个不同的数组。

这些函数所花费的长时间是由__chkstk执行的堆栈探测引起的，这是堆栈溢出检测的一部分(当编译器需要超过1页的空间来容纳所有本地变量时必需的)。