使用C程序理解CPU缓存

3
我正在尝试用C程序来理解CPU缓存和缓存行,这是我学习大多数C概念的方式。我使用的程序如下,灵感来自博客: http://igoro.com/archive/gallery-of-processor-cache-effects/ 现在,在我的机器上,使用CFLAGS="-g -O0 -Wall"编译得到的输出如下:
./cache
CPU time for loop 1 0.460000 secs.
CPU time for loop 2 (j = 8) 0.050000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.050000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.050000 secs.
CPU time for loop 2 (j = 17) 0.040000 secs.
CPU time for loop 2 (j = 18) 0.050000 secs.
CPU time for loop 2 (j = 19) 0.040000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.040000 secs.
CPU time for loop 2 (j = 24) 0.030000 secs.
CPU time for loop 2 (j = 25) 0.040000 secs.
CPU time for loop 2 (j = 26) 0.030000 secs.
CPU time for loop 2 (j = 27) 0.040000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.040000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.

优化后的输出结果 (CFLAGS=-g -O3 -Wall)
CPU time for loop 1 0.130000 secs.
CPU time for loop 2 (j = 8) 0.040000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.040000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.040000 secs.
CPU time for loop 2 (j = 17) 0.050000 secs.
CPU time for loop 2 (j = 18) 0.040000 secs.
CPU time for loop 2 (j = 19) 0.050000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.030000 secs.
CPU time for loop 2 (j = 24) 0.040000 secs.
CPU time for loop 2 (j = 25) 0.030000 secs.
CPU time for loop 2 (j = 26) 0.040000 secs.
CPU time for loop 2 (j = 27) 0.030000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.030000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.

在博客中指出,第一个循环将数组中的每个值乘以3,第二个循环仅将每16个值乘以3。第二个循环仅执行了大约6%的第一个循环工作量,但在现代计算机上,这两个for循环所需时间相同:在我的计算机上分别为80ms和78ms。
在我的计算机上似乎不是这种情况。正如您所看到的,执行第一个循环所需的时间是第二个循环的两倍以上。
loop 1 is 0.46 seconds.

并且那是针对

loop 2 is 0.03 seconds or 0.04 seconds or 0.05 seconds

针对不同的j值,为什么会出现这种情况?

为什么会发生这种情况?

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>

#define MAX_SIZE (64*1024*1024)

int main()
{
    clock_t start, end;
    double cpu_time;
    int i = 0;
    int j = 0;
    /* MAX_SIZE array is too big for stack. This is an unfortunate rough edge of the way the stack works.
     It lives in a fixed-size buffer, set by the program executable's configuration according to the
     operating system, but its actual size is seldom checked against the available space. */
    /* int arr[MAX_SIZE]; */

    int *arr = (int*)malloc(MAX_SIZE * sizeof(int));

    /* CPU clock ticks count start */
    start = clock();

    /* Loop 1 */
    for (i = 0; i < MAX_SIZE; i++)
        arr[i] *= 3;

    /* CPU clock ticks count stop */
    end = clock();

    cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

    printf("CPU time for loop 1 %.6f secs.\n", cpu_time);

    for (j = 8 ; j < 32 ; j++)
    {
        /* CPU clock ticks count start */
        start = clock();

        /* Loop 2 */
        for (i = 0; i < MAX_SIZE; i += j)
            arr[i] *= 3;

        /* CPU clock ticks count stop*/
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
    }

    return 0;
}

重复的问题?https://dev59.com/HHPYa4cB1Zd3GeqPiGKE - hit
并不完全重复。我之前发布了关于段错误的问题。我已经解决了那个问题,现在尝试对结果进行解释。 - liv2hak
2
@hit:虽然代码对两个问题都是相同的,但实际上所问的问题却非常不同... - Mats Petersson
1
你能报告一下你正在使用哪个编译器以及传递给它的标志吗? - Phil Miller
1
哇,这是一个相当古老的编译器。此外,在进行基准测试时,您应该启用优化编译。我敢问你在运行这个上面的硬件是什么? - Phil Miller
显示剩余2条评论
1个回答

3
我稍微修改了一下代码。首先是修改的总结:
  1. 将MAX_SIZE显著增大,以确保在发生变化时有明显的差异。(现在使用了整整2GB的内存,所以不要在32位操作系统上这样做)
  2. 运行循环1几次(在我的机器上,这会产生一些影响,因为我的机器第一次运行会比较慢——这可能是因为malloc实际上没有将内存映射到进程的地址空间中,因此在第一个循环中,我们会得到一些额外的映射内存的开销)。这也确保了CPU在执行其他循环时以“全速”运行,而不是处于“节能”状态。
  3. 通过乘以2来更快地改变第二个循环中的j值(<<= 1在这种情况下等同于*= 2——这是使用移位的老习惯)
  4. 使用+= 3而不是*= 3。(乘法比+=稍慢,但在这种情况下几乎没有区别)
  5. 添加一个loop3,它执行与loop2完全相同数量的操作,但仅在一个小的内存范围内进行[使用&与2^n-1值限制范围]。

我使用gcc -Wall -O3 -sdc=c99编译了这段代码,使用的是4.6.3版本,在quad-core Athlon 965、Fedora Core 16 x86-64和16GB RAM上运行。

以下是代码:

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>

#define MAX_SIZE (512*1024*1024)

int main()
{
    clock_t start, end;
    double cpu_time;
    int i = 0;
    int j = 0;
    /* MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works.
       It lives in a fixed-size buffer, set by the program executable's configuration according to the
       operating system, but its actual size is seldom checked against the available space. */
    /* int arr[MAX_SIZE]; */

    int *arr = (int*)malloc(MAX_SIZE * sizeof(int));

    /* CPU clock ticks count start */

    for(int k = 0; k < 3; k++)
    {
        start = clock();

        /* Loop 1 */
        for (i = 0; i < MAX_SIZE; i++)
            arr[i] += 3;

        /* CPU clock ticks count stop */
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 1 %.6f secs.\n", cpu_time);
    }

    for (j = 1 ; j <= 1024 ; j <<= 1)
    {
        /* CPU clock ticks count start */
        start = clock();

        /* Loop 2 */
        for (i = 0; i < MAX_SIZE; i += j)
            arr[i] += 3;

        /* CPU clock ticks count stop */
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
    }


    // Third loop, performing the same operations as loop 2,
    // but only touching 16KB of memory
    for (j = 1 ; j <= 1024 ; j <<= 1)
    {
        /* CPU clock ticks count start */
        start = clock();

        /* Loop 3 */
        for (i = 0; i < MAX_SIZE; i += j)
            arr[i & 0xfff] += 3;

        /* CPU clock ticks count stop */
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 3 (j = %d) %.6f secs.\n", j, cpu_time);
    }
    return 0;
}

然后得到以下结果:

CPU time for loop 1 2.950000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 2 (j = 1) 0.780000 secs.
CPU time for loop 2 (j = 2) 0.700000 secs.
CPU time for loop 2 (j = 4) 0.610000 secs.
CPU time for loop 2 (j = 8) 0.540000 secs.
CPU time for loop 2 (j = 16) 0.560000 secs.
CPU time for loop 2 (j = 32) 0.280000 secs.
CPU time for loop 2 (j = 64) 0.140000 secs.
CPU time for loop 2 (j = 128) 0.090000 secs.
CPU time for loop 2 (j = 256) 0.060000 secs.
CPU time for loop 2 (j = 512) 0.030000 secs.
CPU time for loop 2 (j = 1024) 0.040000 secs.
CPU time for loop 3 (j = 1) 0.470000 secs.
CPU time for loop 3 (j = 2) 0.240000 secs.
CPU time for loop 3 (j = 4) 0.120000 secs.
CPU time for loop 3 (j = 8) 0.050000 secs.
CPU time for loop 3 (j = 16) 0.030000 secs.
CPU time for loop 3 (j = 32) 0.020000 secs.
CPU time for loop 3 (j = 64) 0.010000 secs.
CPU time for loop 3 (j = 128) 0.000000 secs.
CPU time for loop 3 (j = 256) 0.000000 secs.
CPU time for loop 3 (j = 512) 0.000000 secs.
CPU time for loop 3 (j = 1024) 0.000000 secs.

您可以看到,前几个 loop2 的时间相同 - 一旦到达32,时间开始下降,因为处理器不需要每个高速缓存行,但在 loop3 的情况下,每个循环中的操作数量相对直接地影响总时间。
编辑:
乘法(*=3)与加法(+=3)在循环3中并没有太大的区别,除了会使循环时间增加约30%。

1
你可以通过在malloc分配内存后对数组进行memset来解决循环问题。 - SoapBox
@SoapBox:是的,我这么认为,但这并不能阻止其他问题的出现,比如CPU在前几十秒内没有以最高速度运行(由于省电模式),假设memset足够快,不会提高CPU速度到最大 - 不确定它是否是这样。 - Mats Petersson

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接