为什么逐元素相加在分开的循环中比在一个合并的循环中快得多？

Question

为什么逐元素相加在分开的循环中比在一个合并的循环中快得多？

c++performancex86vectorizationcompiler-optimization

2434

假设 a1、b1、c1 和 d1 指向堆内存，并且我的数值代码具有以下核心循环。

const int n = 100000;

for (int j = 0; j < n; j++) {
    a1[j] += b1[j];
    c1[j] += d1[j];
}

这个循环通过另一个外部的for循环执行10,000次。为了加快速度，我将代码改成了：

for (int j = 0; j < n; j++) {
    a1[j] += b1[j];
}

for (int j = 0; j < n; j++) {
    c1[j] += d1[j];
}

使用 Microsoft Visual C++ 10.0 进行完全优化和启用SSE2的32位编译，在 Intel Core 2 Duo (x64) 上，第一个示例需要5.5秒，而双循环示例仅需要1.9秒。

第一个循环的反汇编基本上如下所示（在完整程序中会重复出现约五次）：

movsd       xmm0,mmword ptr [edx+18h]
addsd       xmm0,mmword ptr [ecx+20h]
movsd       mmword ptr [ecx+20h],xmm0
movsd       xmm0,mmword ptr [esi+10h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [edx+20h]
addsd       xmm0,mmword ptr [ecx+28h]
movsd       mmword ptr [ecx+28h],xmm0
movsd       xmm0,mmword ptr [esi+18h]
addsd       xmm0,mmword ptr [eax+38h]

双重循环示例的每个循环生成以下代码（以下块会重复大约三次）：

addsd       xmm0,mmword ptr [eax+28h]
movsd       mmword ptr [eax+28h],xmm0
movsd       xmm0,mmword ptr [ecx+20h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [ecx+28h]
addsd       xmm0,mmword ptr [eax+38h]
movsd       mmword ptr [eax+38h],xmm0
movsd       xmm0,mmword ptr [ecx+30h]
addsd       xmm0,mmword ptr [eax+40h]
movsd       mmword ptr [eax+40h],xmm0

问题被证明与事实无关，因为行为严重取决于数组的大小（n）和CPU缓存。因此，如果有进一步的兴趣，我重新提出问题：

您能否对导致以下图表上的五个区域所示不同缓存行为的详细信息提供一些实质性的见解？
通过为这些CPU提供类似的图表，指出CPU/缓存架构之间的差异也可能很有趣。

以下是完整的代码。它使用TBB Tick_Count 进行更高分辨率的计时，可以通过不定义TBB_TIMING 宏来禁用：

#include <iostream>
#include <iomanip>
#include <cmath>
#include <string>

//#define TBB_TIMING

#ifdef TBB_TIMING   
#include <tbb/tick_count.h>
using tbb::tick_count;
#else
#include <time.h>
#endif

using namespace std;

//#define preallocate_memory new_cont

enum { new_cont, new_sep };

double *a1, *b1, *c1, *d1;


void allo(int cont, int n)
{
    switch(cont) {
      case new_cont:
        a1 = new double[n*4];
        b1 = a1 + n;
        c1 = b1 + n;
        d1 = c1 + n;
        break;
      case new_sep:
        a1 = new double[n];
        b1 = new double[n];
        c1 = new double[n];
        d1 = new double[n];
        break;
    }

    for (int i = 0; i < n; i++) {
        a1[i] = 1.0;
        d1[i] = 1.0;
        c1[i] = 1.0;
        b1[i] = 1.0;
    }
}

void ff(int cont)
{
    switch(cont){
      case new_sep:
        delete[] b1;
        delete[] c1;
        delete[] d1;
      case new_cont:
        delete[] a1;
    }
}

double plain(int n, int m, int cont, int loops)
{
#ifndef preallocate_memory
    allo(cont,n);
#endif

#ifdef TBB_TIMING   
    tick_count t0 = tick_count::now();
#else
    clock_t start = clock();
#endif
        
    if (loops == 1) {
        for (int i = 0; i < m; i++) {
            for (int j = 0; j < n; j++){
                a1[j] += b1[j];
                c1[j] += d1[j];
            }
        }
    } else {
        for (int i = 0; i < m; i++) {
            for (int j = 0; j < n; j++) {
                a1[j] += b1[j];
            }
            for (int j = 0; j < n; j++) {
                c1[j] += d1[j];
            }
        }
    }
    double ret;

#ifdef TBB_TIMING   
    tick_count t1 = tick_count::now();
    ret = 2.0*double(n)*double(m)/(t1-t0).seconds();
#else
    clock_t end = clock();
    ret = 2.0*double(n)*double(m)/(double)(end - start) *double(CLOCKS_PER_SEC);
#endif
    
#ifndef preallocate_memory
    ff(cont);
#endif

    return ret;
}


void main()
{   
    freopen("C:\\test.csv", "w", stdout);

    char *s = " ";

    string na[2] ={"new_cont", "new_sep"};

    cout << "n";

    for (int j = 0; j < 2; j++)
        for (int i = 1; i <= 2; i++)
#ifdef preallocate_memory
            cout << s << i << "_loops_" << na[preallocate_memory];
#else
            cout << s << i << "_loops_" << na[j];
#endif
            
    cout << endl;

    long long nmax = 1000000;

#ifdef preallocate_memory
    allo(preallocate_memory, nmax);
#endif
    
    for (long long n = 1L; n < nmax; n = max(n+1, long long(n*1.2)))
    {
        const long long m = 10000000/n;
        cout << n;

        for (int j = 0; j < 2; j++)
            for (int i = 1; i <= 2; i++)
                cout << s << plain(n, m, j, i);
        cout << endl;
    }
}

它显示了不同值的FLOPS，其中包括n。

- Johannes Gerer

4

可能是操作系统在每次访问时搜索物理内存，而且有类似缓存的东西以备再次访问同一内存块时减慢了速度。 - AlexTheo

9

你是否正在使用优化编译？这段代码在 O2 优化下看起来有很多汇编代码... - Luchian Grigore

74

仅仅挑剔一下，这两段代码片段由于指针可能重叠而不等同。 C99对于这种情况有“restrict”关键字。我不知道MSVC是否有类似的东西。当然，如果这是问题，那么SSE代码将不正确。 - user510306

9

这可能与内存别名有关。如果使用一个循环，d1[j] 可能与 a1[j] 产生别名，因此编译器可能会放弃执行某些内存优化。然而，如果将写入内存分为两个循环，就不会发生这种情况。 - rturrado

1

我敢打赌，L1D、L1D_CACHE_ST、L2_RQSTS和L2_DATA_RQSTS性能计数器会很有启示性。请参见Intel Core i7（Nehalem）事件。 - jww

显示剩余11条评论

11个回答

232

好的，正确答案肯定与CPU缓存有关。但在没有数据的情况下使用缓存参数可能会非常困难。

有许多答案导致了很多讨论，但是让我们面对现实：缓存问题可以非常复杂，不是一维的。它们严重依赖于数据量，所以我的问题是不公平的：它在缓存图中处于非常有趣的点。

@Mysticial的答案说服了很多人（包括我），可能是因为它似乎是唯一依赖事实的答案，但它只是“真相”的一个“数据点”。

这就是为什么我结合了他的测试（使用连续vs.分离分配）和@James'Answer的建议。

下面的图表显示，大多数答案和尤其是问题和答案的大部分评论可以根据所使用的确切场景和参数被认为是完全错误或正确的。

请注意，我的初始问题是在n = 100.000。这个点（偶然地）展现了特殊的行为：

它具有一个最大的单循环版本和双循环版本之间的差异（几乎相差三倍）
它是唯一一个单循环版本（即具有连续分配）击败了双循环版本的点。（这使Mysticial的答案成为可能。）

使用初始化数据的结果：

输入图像描述

使用未初始化数据的结果（这是Mysticial测试的内容）：

输入图像描述

这是一个难以解释的结果：为每个后续不同矢量大小的测试用例分配一次并重复使用的已初始化数据：

输入图像描述

建议

每个在Stack Overflow上与低级性能相关的问题都应该提供整个缓存相关数据大小范围内的MFLOPS信息！没有这些信息去思考答案，尤其是与他人讨论答案，将是浪费大家的时间。

- Johannes Gerer

18

+1 分析不错。我一开始并不打算让数据未初始化，只是由于分配器将它们清零了。因此，初始化的数据才是重要的。我刚刚用实际的 Core 2 架构机器编辑了我的答案，结果更接近你所观察到的结果。还有一件事是，我测试了一系列大小为 n 的值，并且对于 n = 80000、n = 100000、n = 200000 等情况都显示出了同样的性能差距... - Mysticial

84

第二个循环涉及的缓存活动要少得多，因此处理器更容易跟上内存需求。

- Puppy

48

假设你正在使用一台机器，此时n的值是刚好能够只同时在内存中保存两个数组，但通过磁盘缓存可用内存总量足以容纳全部四个数组。

基于一个简单的后进先出（LIFO）缓存策略，以下代码：

for(int j=0;j<n;j++){
    a[j] += b[j];
}
for(int j=0;j<n;j++){
    c[j] += d[j];
}

首先会导致a和b被加载到RAM中，然后完全在RAM中进行操作。当第二个循环开始时，c和d将从磁盘加载到RAM中并进行操作。

另一个循环

for(int j=0;j<n;j++){
    a[j] += b[j];
    c[j] += d[j];
}

每次循环将两个数组换出并将另外两个数组换入。显然，这会慢得多。

你可能在测试中没有看到磁盘缓存，但你可能看到了其他某种形式的缓存的副作用。

似乎有一些混淆/误解，所以我将尝试使用一个例子进行详细说明。

假设 n = 2 并且我们正在处理字节。在我的场景中，我们只有4个字节的RAM，其余的内存速度要慢得多（比如说慢100倍）。

假设一个相当愚蠢的缓存策略是如果字节不在缓存中，就将它放在那里，并在我们操作完这个字节之后也将其后面的字节读取到缓存中，那么你会得到以下情况：

对于

for(int j=0;j<n;j++){
 a[j] += b[j];
}
for(int j=0;j<n;j++){
 c[j] += d[j];
}

缓存 a[0] 和 a[1]，然后缓存 b[0] 和 b[1]，在缓存中设置 a[0] = a[0] + b[0] - 现在有四个字节在缓存中，a[0]，a[1] 和 b[0]，b[1]。成本=100+100。

在缓存中设置 a[1] = a[1] + b[1]。成本=1+1。

对于 c 和 d 重复。

总成本= (100 + 100 + 1 + 1) * 2 = 404

使用

for(int j=0;j<n;j++){
 a[j] += b[j];
 c[j] += d[j];
}

缓存 a[0] 和 a[1]，然后缓存 b[0] 和 b[1]，接着在缓存中设置 a[0] = a[0] + b[0] — 现在缓存中有四个字节：a[0], a[1] 和 b[0], b[1]。成本 = 100 + 100。

将 a[0], a[1], b[0], b[1] 挤出缓存，并缓存 c[0] 和 c[1]，然后缓存 d[0] 和 d[1]，最后在缓存中设置 c[0] = c[0] + d[0]。成本 = 100 + 100。

我猜你已经看出我的意思了。

总成本 = (100 + 100 + 100 + 100) * 2 = 800

这是一个经典的缓存抖动场景。

- OldCurmudgeon

15

这是不正确的。对数组中特定元素的引用并不会导致整个数组从磁盘（或非缓存内存）中分页调入；只有相关的页面或高速缓存线被调入。 - Brooks Moses

在现代CPU上，四个读取流（其中两个也是写入）基本上很好，与两个读取流（其中一个也被写入）没有显着的差别。例如，在现代英特尔CPU上，HW L2预取可以每页跟踪一个前向流。 - Peter Cordes

35

这并不是因为不同的代码，而是因为缓存：RAM比CPU寄存器慢，并且在CPU内部有一个高速缓存内存，避免每次变量更改时都写入RAM。但是缓存大小不如RAM大，因此它只映射其中的一小部分。

第一个代码修改远程内存地址，每次循环交替使用，因此需要不断地使缓存失效。

第二个代码不进行交替：它只在相邻地址上执行两次位置访问。这使得所有的工作都在缓存中完成，在第二个循环开始之前仅使其失效。

- Emilio Garavaglia

21

我无法复制此处讨论的结果。

我不知道是糟糕的基准代码有问题，还是其他原因，但使用以下代码，在我的机器上两种方法相差不到10％，并且一个循环通常只比两个慢一点 - 正如你所期望的那样。

数组大小从2 ^ 16到2 ^ 24，使用八个循环。我小心地初始化源数组，以便 + = 赋值不要求FPU将内存垃圾解释为双精度数进行加法运算。

我尝试了各种方案，例如将b [j]，d [j]的分配改为在循环内部使用InitToZero [j]，还尝试使用 += b [j] = 1和+= d [j] = 1，得到了相当一致的结果。

正如你所预料的那样，在循环中使用InitToZero[j]初始化b和d给予了组合方法优势，因为它们在对a和c赋值之前连续进行，但仍然在10%以内。真是让人费解。

硬件配置为Dell XPS 8500，搭载第三代Core i7 @ 3.4 GHz处理器和8 GB内存。对于2^16到2^24的计算，使用八个循环，累计时间分别为44.987和40.965。使用Visual C++ 2010进行全面优化。

附注：我将循环改为递减至零，组合方法略微更快。让我想想。请注意新的数组大小和循环次数。

// MemBufferMystery.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <iostream>
#include <cmath>
#include <string>
#include <time.h>

#define  dbl    double
#define  MAX_ARRAY_SZ    262145    //16777216    // AKA (2^24)
#define  STEP_SZ           1024    //   65536    // AKA (2^16)

int _tmain(int argc, _TCHAR* argv[]) {
    long i, j, ArraySz = 0,  LoopKnt = 1024;
    time_t start, Cumulative_Combined = 0, Cumulative_Separate = 0;
    dbl *a = NULL, *b = NULL, *c = NULL, *d = NULL, *InitToOnes = NULL;

    a = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
    b = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
    c = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
    d = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
    InitToOnes = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
    // Initialize array to 1.0 second.
    for(j = 0; j< MAX_ARRAY_SZ; j++) {
        InitToOnes[j] = 1.0;
    }

    // Increase size of arrays and time
    for(ArraySz = STEP_SZ; ArraySz<MAX_ARRAY_SZ; ArraySz += STEP_SZ) {
        a = (dbl *)realloc(a, ArraySz * sizeof(dbl));
        b = (dbl *)realloc(b, ArraySz * sizeof(dbl));
        c = (dbl *)realloc(c, ArraySz * sizeof(dbl));
        d = (dbl *)realloc(d, ArraySz * sizeof(dbl));
        // Outside the timing loop, initialize
        // b and d arrays to 1.0 sec for consistent += performance.
        memcpy((void *)b, (void *)InitToOnes, ArraySz * sizeof(dbl));
        memcpy((void *)d, (void *)InitToOnes, ArraySz * sizeof(dbl));

        start = clock();
        for(i = LoopKnt; i; i--) {
            for(j = ArraySz; j; j--) {
                a[j] += b[j];
                c[j] += d[j];
            }
        }
        Cumulative_Combined += (clock()-start);
        printf("\n %6i miliseconds for combined array sizes %i and %i loops",
                (int)(clock()-start), ArraySz, LoopKnt);
        start = clock();
        for(i = LoopKnt; i; i--) {
            for(j = ArraySz; j; j--) {
                a[j] += b[j];
            }
            for(j = ArraySz; j; j--) {
                c[j] += d[j];
            }
        }
        Cumulative_Separate += (clock()-start);
        printf("\n %6i miliseconds for separate array sizes %i and %i loops \n",
                (int)(clock()-start), ArraySz, LoopKnt);
    }
    printf("\n Cumulative combined array processing took %10.3f seconds",
            (dbl)(Cumulative_Combined/(dbl)CLOCKS_PER_SEC));
    printf("\n Cumulative seperate array processing took %10.3f seconds",
        (dbl)(Cumulative_Separate/(dbl)CLOCKS_PER_SEC));
    getchar();

    free(a); free(b); free(c); free(d); free(InitToOnes);
    return 0;
}

我不确定为什么决定使用MFLOPS作为一个相关的指标。我认为重点是关注内存访问，所以我试图尽量减少浮点计算时间。我保留了+=，但我不确定为什么。

一个纯粹的赋值而没有计算会更清晰地测试内存访问时间，并且会创建一个无论循环次数如何都是统一的测试。也许我在对话中错过了什么，但这值得再三考虑。如果在赋值中省略加号，则累计时间几乎相同，每个时间为31秒。

- user1899861

18

因为CPU没有那么多缓存未命中（需要等待数组数据从RAM芯片中读取）。你可以尝试不断调整数组大小，使其超过CPU的一级缓存（L1）和二级缓存（L2），并绘制代码执行时间与数组大小的图表。该图表不应该像您预期的那样呈直线。

- James

13

第一个循环交替写入每个变量。第二个和第三个循环只跳转小的元素大小。

尝试用笔纸在相距20厘米的地方写下两行各有20个十字架。先完成一行再完成另一行，然后再尝试在每一行中交替写一个十字架。

- Guillaume Kiz

6

原始问题

为什么一个循环比两个循环慢那么多？

结论：

案例1是一个经典的插值问题，但它是一个效率低下的问题。我认为这也是许多机器架构和开发人员最终建立和设计具有执行多线程应用程序以及并行编程能力的多核系统的主要原因之一。

从这种方法来看，不涉及硬件、操作系统和编译器如何协同工作来进行涉及RAM、缓存、页面文件等堆分配的数学计算，这些算法的基础告诉我们哪个解决方案更好。

我们可以使用一个比喻，将老板比作求和，代表必须在工人A和B之间移动的For Loop。

我们可以很容易地看出，由于工人之间需要旅行的距离和时间差异，情况2至少比情况1快一半以上。这个数学计算几乎完美地与基准时间以及装配指令的差异数量相吻合。

我现在将开始解释下面所有的工作原理。

评估问题

原帖中的代码：

const int n=100000;

for(int j=0;j<n;j++){
    a1[j] += b1[j];
    c1[j] += d1[j];
}

而且

for(int j=0;j<n;j++){
    a1[j] += b1[j];
}
for(int j=0;j<n;j++){
    c1[j] += d1[j];
}

考虑因素

考虑到原帖提出的有关两种变体for循环以及他后来修改的问题，涉及缓存行为的许多其他优秀答案和有用评论；我想在这里尝试通过不同的方式来处理这种情况和问题。

方法

考虑到两个循环以及有关缓存和页面文件的所有讨论，我想从不同的角度来看待这个问题。这种方法不涉及缓存和页面文件，也不涉及执行以分配内存，事实上，这种方法甚至不涉及实际硬件或软件。

视角

看了一段时间的代码后，问题和生成它的原因变得非常明显。让我们将其分解为算法问题，并从使用数学符号的角度来看待它，然后将类比应用于数学问题以及算法。

我们所知道的

我们知道这个循环将运行100,000次。我们也知道a1、b1、c1和d1是64位架构上的指针。在32位机器上的C++中，所有指针都是4字节，在64位机器上，它们的大小为8字节，因为指针具有固定长度。

我们知道我们有32字节可供分配，在两种情况下都是如此。唯一的区别是我们在每次迭代中分配32字节或两组2-8字节，在第二种情况下，我们为每个独立循环分配16字节的内存。

两个循环的总分配仍然相等，均为32字节。有了这些信息，让我们来展示这些概念的一般数学、算法和类比。

我们确实知道在两种情况下需要执行相同集合或组的操作的次数。我们确实知道在两种情况下需要分配的内存量。我们可以评估两种情况之间分配工作负载的整体工作量大致相同。

What We Don't Know

We do not know how long it will take for each case unless if we set a counter and run a benchmark test. However, the benchmarks were already included from the original question and from some of the answers and comments as well; and we can see a significant difference between the two and this is the whole reasoning for this proposal to this problem.

Let's Investigate

It is already apparent that many have already done this by looking at the heap allocations, benchmark tests, looking at RAM, cache, and page files. Looking at specific data points and specific iteration indices were also included and the various conversations about this specific problem have many people starting to question other related things about it. How do we begin to look at this problem by using mathematical algorithms and applying an analogy to it? We start off by making a couple of assertions! Then we build out our algorithm from there.

Our Assertions:

We will let our loop and its iterations be a Summation that starts at 1 and ends at 100000 instead of starting with 0 as in the loops for we don't need to worry about the 0 indexing scheme of memory addressing since we are just interested in the algorithm itself.
In both cases we have four functions to work with and two function calls with two operations being done on each function call. We will set these up as functions and calls to functions as the following: F1(), F2(), f(a), f(b), f(c) and f(d).

The Algorithms:

1st Case: - Only one summation but two independent function calls.

Sum n=1 : [1,100000] = F1(), F2();
                       F1() = { f(a) = f(a) + f(b); }
                       F2() = { f(c) = f(c) + f(d); }

2nd Case: - Two summations but each has its own function call.

Sum1 n=1 : [1,100000] = F1();
                        F1() = { f(a) = f(a) + f(b); }

Sum2 n=1 : [1,100000] = F1();
                        F1() = { f(c) = f(c) + f(d); }

If you noticed F2() only exists in Sum from Case1 where F1() is contained in Sum from Case1 and in both Sum1 and Sum2 from Case2. This will be evident later on when we begin to conclude that there is an optimization that is happening within the second algorithm.

The iterations through the first case Sum calls f(a) that will add to its self f(b) then it calls f(c) that will do the same but add f(d) to itself for each 100000 iterations. In the second case, we have Sum1 and Sum2 that both act the same as if they were the same function being called twice in a row.

In this case we can treat Sum1 and Sum2 as just plain old Sum where Sum in this case looks like this: Sum n=1 : [1,100000] { f(a) = f(a) + f(b); } and now this looks like an optimization where we can just consider it to be the same function.

Summary with Analogy

With what we have seen in the second case it almost appears as if there is optimization since both for loops have the same exact signature, but this isn't the real issue. The issue isn't the work that is being done by f(a), f(b), f(c), and f(d). In both cases and the comparison between the two, it is the difference in the distance that the Summation has to travel in each case that gives you the difference in execution time.

Think of the for loops as being the summations that does the iterations as being a Boss that is giving orders to two people A & B and that their jobs are to meat C & D respectively and to pick up some package from them and return it. In this analogy, the for loops or summation iterations and condition checks themselves don't actually represent the Boss. What actually represents the Boss is not from the actual mathematical algorithms directly but from the actual concept of Scope and Code Block within a routine or subroutine, method, function, translation unit, etc. The first algorithm has one scope where the second algorithm has two consecutive scopes.

Within the first case on each call slip, the Boss goes to A and gives the order and A goes off to fetch B's package then the Boss goes to C and gives the orders to do the same and receive the package from D on each iteration.

Within the second case, the Boss works directly with A to go and fetch B's package until all packages are received. Then the Boss works with C to do the same for getting all of D's packages.

Since we are working with an 8-byte pointer and dealing with heap allocation let's consider the following problem. Let's say that the Boss is 100 feet from A and that A is 500 feet from C. We don't need to worry about how far the Boss is initially from C because of the order of executions. In both cases, the Boss initially travels from A first then to B. This analogy isn't to say that this distance is exact; it is just a useful test case scenario to show the workings of the algorithms.

In many cases when doing heap allocations and working with the cache and page files, these distances between address locations may not vary that much or they can vary significantly depending on the nature of the data types and the array sizes.

The Test Cases:

First Case: On first iteration the Boss has to initially go 100 feet to give the order slip to A and A goes off and does his thing, but then the Boss has to travel 500 feet to C to give him his order slip. Then on the next iteration and every other iteration after the Boss has to go back and forth 500 feet between the two.

Second Case: The Boss has to travel 100 feet on the first iteration to A, but after that, he is already there and just waits for A to get back until all slips are filled. Then the Boss has to travel 500 feet on the first iteration to C because C is 500 feet from A. Since this Boss( Summation, For Loop ) is being called right after working with A he then just waits there as he did with A until all of C's order slips are done.

The Difference In Distances Traveled

const n = 100000
distTraveledOfFirst = (100 + 500) + ((n-1)*(500 + 500));
// Simplify
distTraveledOfFirst = 600 + (99999*1000);
distTraveledOfFirst = 600 + 99999000;
distTraveledOfFirst = 99999600
// Distance Traveled On First Algorithm = 99,999,600ft

distTraveledOfSecond = 100 + 500 = 600;
// Distance Traveled On Second Algorithm = 600ft;

The Comparison of Arbitrary Values

We can easily see that 600 is far less than approximately 100 million. Now, this isn't exact, because we don't know the actual difference in distance between which address of RAM or from which cache or page file each call on each iteration is going to be due to many other unseen variables. This is just an assessment of the situation to be aware of and looking at it from the worst-case scenario.

From these numbers it would almost appear as if algorithm one should be 99% slower than algorithm two; however, this is only the Boss's part or responsibility of the algorithms and it doesn't account for the actual workers A, B, C, & D and what they have to do on each and every iteration of the Loop. So the boss's job only accounts for about 15 - 40% of the total work being done. The bulk of the work that is done through the workers has a slightly bigger impact towards keeping the ratio of the speed rate differences to about 50-70%

The Observation: - The differences between the two algorithms

In this situation, it is the structure of the process of the work being done. It goes to show that Case 2 is more efficient from both the partial optimization of having a similar function declaration and definition where it is only the variables that differ by name and the distance traveled.

We also see that the total distance traveled in Case 1 is much farther than it is in Case 2 and we can consider this distance traveled our Time Factor between the two algorithms. Case 1 has considerable more work to do than Case 2 does.

This is observable from the evidence of the assembly instructions that were shown in both cases. Along with what was already stated about these cases, this doesn't account for the fact that in Case 1 the boss will have to wait for both A & C to get back before he can go back to A again for each iteration. It also doesn't account for the fact that if A or B is taking an extremely long time then both the Boss and the other worker(s) are idle waiting to be executed.

In Case 2 the only one being idle is the Boss until the worker gets back. So even this has an impact on the algorithm.

The OP's Amended Question(s)

EDIT: The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:

Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?

It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.

Regarding These Questions

As I have demonstrated without a doubt, there is an underlying issue even before the Hardware and Software becomes involved.

Now as for the management of memory and caching along with page files, etc. which all work together in an integrated set of systems between the following:

The architecture (hardware, firmware, some embedded drivers, kernels and assembly instruction sets).
The OS (file and memory management systems, drivers and the registry).
The compiler (translation units and optimizations of the source code).
And even the source code itself with its set(s) of distinctive algorithms.

We can already see that there is a bottleneck that is happening within the first algorithm before we even apply it to any machine with any arbitrary architecture, OS, and programmable language compared to the second algorithm. There already existed a problem before involving the intrinsics of a modern computer.

The Ending Results

However; it is not to say that these new questions are not of importance because they themselves are and they do play a role after all. They do impact the procedures and the overall performance and that is evident with the various graphs and assessments from many who have given their answer(s) and or comment(s).

If you paid attention to the analogy of the Boss and the two workers A & B who had to go and retrieve packages from C & D respectively and considering the mathematical notations of the two algorithms in question; you can see without the involvement of the computer hardware and software Case 2 is approximately 60% faster than Case 1.

When you look at the graphs and charts after these algorithms have been applied to some source code, compiled, optimized, and executed through the OS to perform their operations on a given piece of hardware, you can even see a little more degradation between the differences in these algorithms.

If the Data set is fairly small it may not seem all that bad of a difference at first. However, since Case 1 is about 60 - 70% slower than Case 2 we can look at the growth of this function in terms of the differences in time executions:

DeltaTimeDifference approximately = Loop1(time) - Loop2(time)
//where
Loop1(time) = Loop2(time) + (Loop2(time)*[0.6,0.7]) // approximately
// So when we substitute this back into the difference equation we end up with
DeltaTimeDifference approximately = (Loop2(time) + (Loop2(time)*[0.6,0.7])) - Loop2(time)
// And finally we can simplify this to
DeltaTimeDifference approximately = [0.6,0.7]*Loop2(time)

This approximation is the average difference between these two loops both algorithmically and machine operations involving software optimizations and machine instructions.

When the data set grows linearly, so does the difference in time between the two. Algorithm 1 has more fetches than algorithm 2 which is evident when the Boss has to travel back and forth the maximum distance between A & C for every iteration after the first iteration while algorithm 2 the Boss has to travel to A once and then after being done with A he has to travel a maximum distance only one time when going from A to C.

Trying to have the Boss focusing on doing two similar things at once and juggling them back and forth instead of focusing on similar consecutive tasks is going to make him quite angry by the end of the day since he had to travel and work twice as much. Therefore do not lose the scope of the situation by letting your boss getting into an interpolated bottleneck because the boss's spouse and children wouldn't appreciate it.

Amendment: Software Engineering Design Principles

-- The difference between local Stack and heap allocated computations within iterative for loops and the difference between their usages, their efficiencies, and effectiveness --

The mathematical algorithm that I proposed above mainly applies to loops that perform operations on data that is allocated on the heap.

Consecutive Stack Operations:
- If the loops are performing operations on data locally within a single code block or scope that is within the stack frame it will still sort of apply, but the memory locations are much closer where they are typically sequential and the difference in distance traveled or execution time is almost negligible. Since there are no allocations being done within the heap, the memory isn't scattered, and the memory isn't being fetched through ram. The memory is typically sequential and relative to the stack frame and stack pointer.
When consecutive operations are being done on the stack, a modern processor will cache repetitive values and addresses keeping these values within local cache registers. The time of operations or instructions here is on the order of nano-seconds.
Consecutive Heap Allocated Operations:
- When you begin to apply heap allocations and the processor has to fetch the memory addresses on consecutive calls, depending on the architecture of the CPU, the bus controller, and the RAM modules the time of operations or execution can be on the order of micro to milliseconds. In comparison to cached stack operations, these are quite slow.
- The CPU will have to fetch the memory address from RAM and typically anything across the system bus is slow compared to the internal data paths or data buses within the CPU itself.

So when you are working with data that needs to be on the heap and you are traversing through them in loops, it is more efficient to keep each data set and its corresponding algorithms within its own single loop. You will get better optimizations compared to trying to factor out consecutive loops by putting multiple operations of different data sets that are on the heap into a single loop.

It is okay to do this with data that is on the stack since they are frequently cached, but not for data that has to have its memory address queried every iteration.

This is where software engineering and software architecture design comes into play. It is the ability to know how to organize your data, knowing when to cache your data, knowing when to allocate your data on the heap, knowing how to design and implement your algorithms, and knowing when and where to call them.

You might have the same algorithm that pertains to the same data set, but you might want one implementation design for its stack variant and another for its heap-allocated variant just because of the above issue that is seen from its O(n) complexity of the algorithm when working with the heap.

From what I've noticed over the years, many people do not take this fact into consideration. They will tend to design one algorithm that works on a particular data set and they will use it regardless of the data set being locally cached on the stack or if it was allocated on the heap.

If you want true optimization, yes it might seem like code duplication, but to generalize it would be more efficient to have two variants of the same algorithm. One for stack operations, and the other for heap operations that are performed in iterative loops!

Here's a pseudo example: Two simple structs, one algorithm.

struct A {
    int data;
    A() : data{0}{}
    A(int a) : data{a}{}
};
struct B {
    int data;
    B() : data{0}{}
    A(int b) : data{b}{}
}

template<typename T>
void Foo( T& t ) {
    // Do something with t
}

// Some looping operation: first stack then heap.

// Stack data:
A dataSetA[10] = {};
B dataSetB[10] = {};

// For stack operations this is okay and efficient
for (int i = 0; i < 10; i++ ) {
   Foo(dataSetA[i]);
   Foo(dataSetB[i]);
}

// If the above two were on the heap then performing
// the same algorithm to both within the same loop
// will create that bottleneck
A* dataSetA = new [] A();
B* dataSetB = new [] B();
for ( int i = 0; i < 10; i++ ) {
    Foo(dataSetA[i]); // dataSetA is on the heap here
    Foo(dataSetB[i]); // dataSetB is on the heap here
} // this will be inefficient.

// To improve the efficiency above, put them into separate loops...

for (int i = 0; i < 10; i++ ) {
    Foo(dataSetA[i]);
}
for (int i = 0; i < 10; i++ ) {
    Foo(dataSetB[i]);
}
// This will be much more efficient than above.
// The code isn't perfect syntax, it's only pseudo code
// to illustrate a point.

This is what I was referring to by having separate implementations for stack variants versus heap variants. The algorithms themselves don't matter too much, it's the looping structures that you will use them in that do.

- Francis Cugler

2

现代CPU上，四个读流（其中两个也是写入）基本上很好，与两个读流（其中一个也被写入）相比没有明显的劣势。例如，现代Intel CPU上的HW L2预取可以跟踪每页一个前向流。RAM是随机访问的；元素之间的“行程距离”不是关键因素。只有在缓存行包含a[i+0..7]或其他元素的读/写之间被驱逐时才会出现问题。（或者如果编译器无法看到没有别名，那么它就会破坏SIMD向量化。） - Peter Cordes

1

堆栈和堆只是同一虚拟地址空间的不同部分，由相同的缓存层次结构支持，最终以DRAM结束。每个程序员都应该了解内存？在堆上触摸新分配的页面很慢（页面故障，请参见性能评估的惯用方式？)，但这也适用于堆栈。只是堆栈在函数返回时不会取消映射内存，因此反复调用执行int arr [10000]的函数仅在第一次调用时遇到页面故障。 - Peter Cordes

1

没有“自然存在”的瓶颈。硬件/软件总是很重要的。你可以同样地争论，你可能天真地期望减少的循环开销会使循环融合比循环分裂更快。你似乎基于错误的硬件工作模型来进行论证。正如被接受的答案所示，实际原因是在相对于页面边界具有4个读/写流，并且因此缓存别名和可能的英特尔4k别名效应（如延迟加载的虚假依赖）下，具有相同对齐方式的4个读/写流。 - Peter Cordes

1

你正在发明一种特定的成本模型，它不是随机访问，并基于此进行争论。这不是更基本的，而是另一种特定的计算模型，对于RAM（随机访问存储器）或具有小型集合关联高速缓存和DRAM“页面”的高速缓存层次结构来说，这不是一个好的模型。你的模型会预测a[i]+=1比a[i]+=b[i]要快得多，因为根本没有寻址。但是，在实际CPU上编译和运行时，它的表现并非如此。只有两个单独的写入流之间的4k冲突才会导致这种性能问题。 - Peter Cordes

1

在这一点上，你基本上是错的。它需要相同的工作量，只是访问模式不同而已。某些计算模型容忍多个局部性流，而其他一些计算模型（比如没有缓存的简单CPU）则根本不关心局部性，除了为了将东西保留在寄存器中。RAM抽象计算模型https://en.wikipedia.org/wiki/Random-access_machine对于局部性没有任何好处；假定在任何时间以任何顺序访问内存的任何元素都有平等的代价。当然，这是一个没有缓存的模型，但你的论证在这里失败了。 - Peter Cordes

显示剩余11条评论

2

可能是旧的C++和优化方面的问题。在我的电脑上，我获得了几乎相同的速度：

一个循环：1.577毫秒

两个循环：1.507毫秒

我在一台E5-1620 3.5 GHz处理器和16 GB RAM的电脑上运行Visual Studio 2015。

- mathengineer

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mysticial · Accepted Answer

经过进一步分析，我认为这至少部分归因于四指针的数据对齐。这将导致某些缓存块/路冲突。

如果我猜测您如何分配数组是正确的，它们很可能与页面线对齐。

这意味着每个循环中所有访问都将落在同一缓存路上。然而，英特尔处理器具有8路L1缓存关联度已经有一段时间了。但实际上，性能并不完全均匀。访问4路仍然比访问2路要慢。

编辑：事实上，看起来您正在单独分配所有数组。通常情况下，当请求如此大的分配时，分配器会从操作系统请求新页面。因此，大量分配具有从页面边界开始的相同偏移量的高概率出现。

以下是测试代码：

int main(){
    const int n = 100000;

#ifdef ALLOCATE_SEPERATE
    double *a1 = (double*)malloc(n * sizeof(double));
    double *b1 = (double*)malloc(n * sizeof(double));
    double *c1 = (double*)malloc(n * sizeof(double));
    double *d1 = (double*)malloc(n * sizeof(double));
#else
    double *a1 = (double*)malloc(n * sizeof(double) * 4);
    double *b1 = a1 + n;
    double *c1 = b1 + n;
    double *d1 = c1 + n;
#endif

    //  Zero the data to prevent any chance of denormals.
    memset(a1,0,n * sizeof(double));
    memset(b1,0,n * sizeof(double));
    memset(c1,0,n * sizeof(double));
    memset(d1,0,n * sizeof(double));

    //  Print the addresses
    cout << a1 << endl;
    cout << b1 << endl;
    cout << c1 << endl;
    cout << d1 << endl;

    clock_t start = clock();

    int c = 0;
    while (c++ < 10000){

#if ONE_LOOP
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
            c1[j] += d1[j];
        }
#else
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
        }
        for(int j=0;j<n;j++){
            c1[j] += d1[j];
        }
#endif

    }

    clock_t end = clock();
    cout << "seconds = " << (double)(end - start) / CLOCKS_PER_SEC << endl;

    system("pause");
    return 0;
}

基准测试结果：

更新：在一台真正的Core 2架构机器上的结果：

2 x Intel Xeon X5482 Harpertown @ 3.2 GHz：

#define ALLOCATE_SEPERATE
#define ONE_LOOP
00600020
006D0020
007A0020
00870020
seconds = 6.206

#define ALLOCATE_SEPERATE
//#define ONE_LOOP
005E0020
006B0020
00780020
00850020
seconds = 2.116

//#define ALLOCATE_SEPERATE
#define ONE_LOOP
00570020
00633520
006F6A20
007B9F20
seconds = 1.894

//#define ALLOCATE_SEPERATE
//#define ONE_LOOP
008C0020
00983520
00A46A20
00B09F20
seconds = 1.993

观察结果如下：

一个循环需要6.206秒，两个循环需要2.116秒。这与原帖的结果完全一致。
在前两个测试中，数组是分开分配的。您会注意到它们相对于页面具有相同的对齐方式。
在后两个测试中，数组被紧密地打包在一起以打破对齐。在这里，您会注意到两个循环都更快。此外，第二个（双倍）循环现在是较慢的，正如您通常所期望的那样。

正如@Stephen Cannon在评论中指出的那样，这种对齐可能会导致负别名在加载/存储单元或缓存中。我在Google上搜索了一下，发现Intel实际上有一个用于部分地址别名停顿的硬件计数器：

http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/~amplifierxe/pmw_dp/events/partial_address_alias.html

5个区域-解释

区域1：

这很容易。数据集太小，性能受到循环和分支等开销的支配。

区域2：

~~随着数据大小的增加，相对开销的数量会减少，性能会“饱和”。这里两个循环较慢，因为它具有两倍的循环和分支开销。~~

我不确定这里究竟发生了什么...对齐可能仍然会产生影响，正如Agner Fog提到的高速缓存银行冲突。（该链接是关于Sandy Bridge的，但是这个想法应该仍然适用于Core 2。）

区域3：

此时，数据已不再适合L1缓存。因此，性能受到L1<->L2缓存带宽的限制。

区域4：

单循环中的性能下降就是我们观察到的。正如前面提到的，这是由于对齐导致处理器的加载/存储单元中发生错误别名停顿。

但是，为了发生假别名，数据集之间必须有足够大的跨度。这就是为什么在区域3中看不到这一点。

区域5：

此时，没有任何东西适合缓存。因此，您受制于内存带宽。

2 x Intel X5482 Harpertown @ 3.2 GHz Intel Core i7 870 @ 2.8 GHz Intel Core i7 2600K @ 4.4 GHz