如何解释Cachegrind输出中的缓存未命中？

Question

如何解释Cachegrind输出中的缓存未命中？

c++cachingoptimizationvalgrindcachegrind

17

出于好奇，我编写了几个不同版本的矩阵乘法，并使用cachegrind对其进行了测试。在下面的结果中，我想知道哪些部分是L1、L2、L3缓存未命中和引用，并且这一切真正意味着什么？此外，以下是我矩阵乘法的代码，以防有人需要。

#define SLOWEST
==6933== Cachegrind, a cache and branch-prediction profiler
==6933== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==6933== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==6933== Command: ./a.out 500
==6933== 
--6933-- warning: L3 cache found, using its data for the LL simulation.
--6933-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 60.7487 seconds.
==6933== 
==6933== I   refs:      6,039,791,314
==6933== I1  misses:            1,611
==6933== LLi misses:            1,519
==6933== I1  miss rate:          0.00%
==6933== LLi miss rate:          0.00%
==6933== 
==6933== D   refs:      2,892,704,678  (2,763,005,485 rd   + 129,699,193 wr)
==6933== D1  misses:      136,223,560  (  136,174,705 rd   +      48,855 wr)
==6933== LLd misses:           53,675  (        5,247 rd   +      48,428 wr)
==6933== D1  miss rate:           4.7% (          4.9%     +         0.0%  )
==6933== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==6933== 
==6933== LL refs:         136,225,171  (  136,176,316 rd   +      48,855 wr)
==6933== LL misses:            55,194  (        6,766 rd   +      48,428 wr)
==6933== LL miss rate:            0.0% (          0.0%     +         0.0%  )

#define SLOWER
==8463== Cachegrind, a cache and branch-prediction profiler
==8463== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==8463== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==8463== Command: ./a.out 500
==8463== 
--8463-- warning: L3 cache found, using its data for the LL simulation.
--8463-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 49.7397 seconds.
==8463== 
==8463== I   refs:      4,537,213,120
==8463== I1  misses:            1,571
==8463== LLi misses:            1,487
==8463== I1  miss rate:          0.00%
==8463== LLi miss rate:          0.00%
==8463== 
==8463== D   refs:      2,891,485,608  (2,761,862,312 rd   + 129,623,296 wr)
==8463== D1  misses:       59,961,522  (   59,913,256 rd   +      48,266 wr)
==8463== LLd misses:           53,113  (        5,246 rd   +      47,867 wr)
==8463== D1  miss rate:           2.0% (          2.1%     +         0.0%  )
==8463== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==8463== 
==8463== LL refs:          59,963,093  (   59,914,827 rd   +      48,266 wr)
==8463== LL misses:            54,600  (        6,733 rd   +      47,867 wr)
==8463== LL miss rate:            0.0% (          0.0%     +         0.0%  )

#define SLOW
==9174== Cachegrind, a cache and branch-prediction profiler
==9174== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==9174== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==9174== Command: ./a.out 500
==9174== 
--9174-- warning: L3 cache found, using its data for the LL simulation.
--9174-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 35.8901 seconds.
==9174== 
==9174== I   refs:      3,039,713,059
==9174== I1  misses:            1,570
==9174== LLi misses:            1,486
==9174== I1  miss rate:          0.00%
==9174== LLi miss rate:          0.00%
==9174== 
==9174== D   refs:      1,893,235,586  (1,763,112,301 rd   + 130,123,285 wr)
==9174== D1  misses:       63,285,950  (   62,987,684 rd   +     298,266 wr)
==9174== LLd misses:           53,113  (        5,246 rd   +      47,867 wr)
==9174== D1  miss rate:           3.3% (          3.5%     +         0.2%  )
==9174== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==9174== 
==9174== LL refs:          63,287,520  (   62,989,254 rd   +     298,266 wr)
==9174== LL misses:            54,599  (        6,732 rd   +      47,867 wr)
==9174== LL miss rate:            0.0% (          0.0%     +         0.0%  )

#define MEDIUM
==7838== Cachegrind, a cache and branch-prediction profiler
==7838== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==7838== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==7838== Command: ./a.out 500
==7838== 
--7838-- warning: L3 cache found, using its data for the LL simulation.
--7838-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 23.4097 seconds.
==7838== 
==7838== I   refs:      2,548,967,151
==7838== I1  misses:            1,610
==7838== LLi misses:            1,522
==7838== I1  miss rate:          0.00%
==7838== LLi miss rate:          0.00%
==7838== 
==7838== D   refs:      1,399,237,303  (1,267,363,440 rd   + 131,873,863 wr)
==7838== D1  misses:          592,807  (      293,091 rd   +     299,716 wr)
==7838== LLd misses:           53,147  (        5,248 rd   +      47,899 wr)
==7838== D1  miss rate:           0.0% (          0.0%     +         0.2%  )
==7838== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==7838== 
==7838== LL refs:             594,417  (      294,701 rd   +     299,716 wr)
==7838== LL misses:            54,669  (        6,770 rd   +      47,899 wr)
==7838== LL miss rate:            0.0% (          0.0%     +         0.0%  )

#define MEDIUMISH
==8438== Cachegrind, a cache and branch-prediction profiler
==8438== Copyright (C) 2002-2012, and GNU GPL'd, by Nicholas Nethercote et al.
==8438== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==8438== Command: ./a.out 500
==8438== 
--8438-- warning: L3 cache found, using its data for the LL simulation.
--8438-- warning: pretending that LL cache has associativity 24 instead of actual 16
Multiplied matrix A and B in 24.0327 seconds.
==8438== 
==8438== I   refs:      2,550,211,553
==8438== I1  misses:            1,576
==8438== LLi misses:            1,488
==8438== I1  miss rate:          0.00%
==8438== LLi miss rate:          0.00%
==8438== 
==8438== D   refs:      1,400,107,343  (1,267,610,303 rd   + 132,497,040 wr)
==8438== D1  misses:          339,977  (       42,583 rd   +     297,394 wr)
==8438== LLd misses:           53,114  (        5,248 rd   +      47,866 wr)
==8438== D1  miss rate:           0.0% (          0.0%     +         0.2%  )
==8438== LLd miss rate:           0.0% (          0.0%     +         0.0%  )
==8438== 
==8438== LL refs:             341,553  (       44,159 rd   +     297,394 wr)
==8438== LL misses:            54,602  (        6,736 rd   +      47,866 wr)
==8438== LL miss rate:            0.0% (          0.0%     +         0.0%  )

矩阵乘法代码。

#if defined(SLOWEST)
    void multiply (float **A, float **B, float **out, int size) {
        for (int row=0;row<size;row++)
            for (int col=0;col<size;col++)
                for (int in=0;in<size;in++)
                    out[row][col] += A[row][in] * B[in][col];
    }
// Takes in 1-D arrays, same as before.
#elif defined(SLOWER)
    void multiply (float *A, float *B, float *out, int size) {
        for (int row=0;row<size;row++)
            for (int col=0;col<size;col++)
                for (int in=0;in<size;in++)
                    out[row * size + col] += A[row * size + in] * B[in * size + col];
    }
// Flips first and second loops
#elif defined(SLOW)
    void multiply (float *A, float *B, float *out, int size) {
        for (int col=0;col<size;col++)
            for (int row=0;row<size;row++) {
                float curr = 0;  // prevents from calculating position each time through
                for (int in=0;in<size;in++)
                    curr += A[row * size + in] * B[in *size + col];
                out[row * size + col] = curr;
            }
    }
#elif defined(MEDIUM)
    // Keeps it organized for future codes.
    float dotProduct(float *A, float *B, int size) {
        float curr = 0;

        for (int i=0;i<size;i++)
            curr += A[i] * B[i];

        return curr;
    }
    void multiply (float *A, float *B, float *out, int size) {
        float *temp = new float[size];

        for (int col=0;col<size;col++) {
            for (int i=0;i<size;i++)  // stores column into sequential array
                temp[i] = B[i * size + col];
            for (int row=0;row<size;row++)
                out[row * size + col] = dotProduct(&A[row], temp, size);  // uses function above for dot product.
        }

        delete[] temp;
    }
#elif defined(MEDIUMISH)
    float dotProduct(float *A, float *B, int size) {
        float curr = 0;

        for (int i=0;i<size;i++)
            curr += A[i] * B[i];

        return curr;
    }
    void multiply (float *A, float *B, float *out, int size) {
        for (int i=0;i<size-1;i++)
            for (int j=i+1;j<size;j++)
                std::swap(B[i * size + j], B[j * size + i]);

        for (int col=0;col<size;col++)
            for (int row=0;row<size;row++)
                out[row * size + col] = dotProduct(&A[row], &B[row], size);  // uses function above for dot product.
    }
#elif defined(FAST)

#elif defined(FASTER)

#endif

- Kevin Melkowski

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Manuel Selva · Accepted Answer

根据文档，Cachegrind只模拟第一级和最后一级缓存：

引用如下：

Cachegrind 模拟程序与机器缓存层次结构的交互（可选地）和分支预测器。它模拟具有独立的第一级指令和数据缓存（I1 和 D1）的机器，由统一的第二级缓存（L2）支持。这与许多现代机器的配置完全匹配。

然而，一些现代机器具有三或四级缓存。对于这些机器（在 Cachegrind 能够自动检测到缓存配置的情况下），Cachegrind 模拟第一级和最后一级缓存。选择这样做的原因是最后一级缓存对运行时间的影响最大，因为它掩盖了对主内存的访问。此外，L1 缓存通常具有低关联性，因此模拟它们可以检测代码与此缓存的不良交互（例如，使用幂指数作为矩阵列长度遍历矩阵）。这意味着您无法获得 L2 信息，但在您的情况下仅获得 L1 和 L3 信息。

Cachegrind 输出的第一部分报告有关 L1 指令缓存的信息。在所有示例中，L1 指令缓存 misses 的数量都很小，缺失率始终为 0%。这意味着您的所有程序都适合于 L1 指令缓存。

输出的第二部分报告有关 L1 和 LL（最后一级缓存，在您的情况下是 L3）数据缓存的信息。使用“D1 miss rate”信息，您应该能够看到哪个版本的矩阵乘法算法“最高效使用缓存”。

Cachegrind 输出的最后一部分总结了有关指令和数据的 LL（最后一级缓存，在您的情况下是 L3）。因此，它提供了内存访问数以及由缓存服务的内存请求百分比的信息。