错误的单线程内存带宽基准测试

Question

错误的单线程内存带宽基准测试

c++assemblyperformance-testingbenchmarkingmemory-bandwidth

5

为了测量主存储器的带宽，我提出了以下方法。

代码（针对英特尔编译器）

#include <omp.h>

#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <cstdlib> // std::free
#include <unistd.h> // sysconf
#include <stdlib.h> // posix_memalign
#include <random> // std::mt19937


int main()
{
    // test-parameters
    const auto size = std::size_t{150 * 1024 * 1024} / sizeof(double);
    const auto experiment_count = std::size_t{500};
    
    //+/////////////////
    // access a data-point 'on a whim'
    //+/////////////////
    
    // warm-up
    for (auto counter = std::size_t{}; counter < experiment_count / 2; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }
        
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        
        // deallocate resources
        free(data);
    }
    
    // timed run
    auto min_duration = std::numeric_limits<double>::max();
    for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }
        
        const auto dur1 = omp_get_wtime() * 1E+6;
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        const auto dur2 = omp_get_wtime() * 1E+6;
        const auto run_duration = dur2 - dur1;
        if (run_duration < min_duration)
        {
            min_duration = run_duration;
        }
        
        // deallocate resources
        free(data);
    }
    
    // REPORT
    const auto traffic = size * sizeof(double) * 2; // 1x load, 1x write
    std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;\n"
        << "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl;
    
    return 0;
}

代码说明

这是一种“天真”的方法，也仅适用于linux操作系统。它仍然可以作为模型性能的大致指标。
使用编译器标志-O3 -ffast-math -march=coffeelake编译ICC。
文件大小为150 MiB，比系统的最低级缓存（Coffee Lake的i5-8400）大得多，此处使用了2个16 GiB DIMM DDR4 3200 MT/s内存。
每次迭代的新分配会使上一个迭代的所有缓存行无效（以消除缓存命中）。
最小延迟记录旨在抵消中断和操作系统调度的影响：线程短时间脱离核心等。
进行预热运行以抵消动态频率缩放的影响（内核功能，也可以通过使用userspace管理器关闭）。

代码结果

在我的机器上，我得到了90 GB/s。Intel Advisor运行其自己的基准测试，已经计算或测量出这个带宽实际上是25 GB/s。（请参阅我的以前的问题：Intel Advisor's bandwidth information，在其中此代码的先前版本在定时区域内出现了页故障。）

汇编: 这是以上代码生成的汇编链接: https://godbolt.org/z/Ma7PY49bE

我无法理解为什么我的带宽会如此不合理地高。任何有助于促进我理解的提示都将非常感激。

- Nitin Malapally

1

@Sebastian：缓冲区大小（150MiB）远高于9MiB的L3缓存总大小。使用NT存储确实是可行的，但对于大于L3缓存的尺寸，你期望NT存储更快，因为你只需要为实际写入付费，而不是RFOs。（Enhanced REP MOVSB for memcpy）。仍然，很好的观点值得比较。但我不建议使用wbinvd！非常难用。在定时运行之间再次循环缓冲区，并在其中使用clflushopt，或者将其扩大（如1GiB），以使L3命中更加罕见。 - Peter Cordes

同一用户在之前的一个问题中提到，使用更小的50MiB缓冲区也能获得确切的90GB/s带宽信息，因此这可能不仅仅是时间上的偶然。 - Peter Cordes

1

@Sebastian：哦，根据Godbolt链接，这是使用ICC编译的，并且已经使用vmovntpd ymm NT存储！（其行为与存储到不可缓存的写组合内存相同。） - Peter Cordes

25 GB/s是CPU还是RAM的限制？你的硬件配置是什么？CPU/CPU时钟/主板/内存/银行/内存频率/等待状态？25 GB/s的参考资料在哪里？ - Sebastian

1

您的CPU（在其他问题中提到：Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz（Turbo 4.0 GHz）[Coffee Lake]）可以处理两个内存通道。理论上，内存的最大速率将是（如果取决于您的主板和内存安装的插槽，两个通道都可以使用）3200 MT/s * 8 B/T * 2 = 51,200 MB/s。英特尔将最大值规定为41.6 GB / s，可能是较慢的内存速度。 - Sebastian

显示剩余15条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nitin Malapally · Accepted Answer

实际上，问题似乎是，“为什么获得的带宽如此之高？”，对此我已经从@PeterCordes和@Sebastian那里得到了很多意见。这些信息需要在自己的时间内消化。

我仍然可以提供一个辅助的“答案”来回答感兴趣的主题。通过将写操作（现在我理解，在没有深入研究汇编的基准测试中无法正确地建模）替换为一种便宜的操作，例如位运算，我们可以防止编译器过于出色地完成其工作。 更新的代码

#include <omp.h>

#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <cstdlib> // std::free
#include <unistd.h> // sysconf
#include <stdlib.h> // posix_memalign


int main()
{
    // test-parameters
    const auto size = std::size_t{100 * 1024 * 1024};
    const auto experiment_count = std::size_t{250};
    
    //+/////////////////
    // access a data-point 'on a whim'
    //+/////////////////
    
    // allocate for exp. data and load the memory pages
    char* data = nullptr;
    posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size);
    if (data == nullptr)
    {
        std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
        std::abort();
    }
    for (auto index = std::size_t{}; index < size; ++index)
    {
        data[index] = 0;
    }
    
    // timed run
    auto min_duration = std::numeric_limits<double>::max();
    for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
    {
        // run
        const auto dur1 = omp_get_wtime() * 1E+6;
        #pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] ^= 1;
        }
        const auto dur2 = omp_get_wtime() * 1E+6;
        const auto run_duration = dur2 - dur1;
        if (run_duration < min_duration)
        {
            min_duration = run_duration;
        }
    }
    
    // deallocate resources
    free(data);
        
    // REPORT
    const auto traffic = size * 2; // 1x load, 1x write
    std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;\n"
        << "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl;
    
    return 0;
}

基准测试仍然是一种“天真”的方法，只能作为模型性能的指标（而不是可以精确计算内存带宽的程序）。更新后的代码中，单线程时我得到了24 GiB/s，当所有6个核心都参与时则为37 GiB/s。与英特尔顾问测得的值相比，即25.5 GiB/s和37.5 GiB/s，我认为这是可以接受的。@PeterCordes 我保留了预热循环，以便执行整个过程的完全相同运行，以抵消未知效果（健康程序员的偏执症）。编辑在这种情况下，预热循环确实是多余的，因为最小持续时间正在被计时。