为什么不同大小的整数数组具有不同的性能表现？

Question

为什么不同大小的整数数组具有不同的性能表现？

c++arraysperformance

31

我有以下问题：

对于int8、int16、int32和int64类型的std::array，每增加一倍大小，写入时间就会加倍。我可以理解在8位CPU上的这种行为，但在32/64位上就不应该出现这种情况。

为什么32位系统需要4倍的时间来保存32位的值，而保存8位的值只需要1倍的时间呢？

以下是我的测试代码：

#include <iostream>
#include <array>
#include <chrono>

std::array<std::int8_t, 64 * 1024 * 1024> int8Array;
std::array<std::int16_t, 64 * 1024 * 1024> int16Array;
std::array<std::int32_t, 64 * 1024 * 1024> int32Array;
std::array<std::int64_t, 64 * 1024 * 1024> int64Array;

void PutZero()
{
    auto point1 = std::chrono::high_resolution_clock::now();
    for (auto &v : int8Array) v = 0;
    auto point2 = std::chrono::high_resolution_clock::now();
    for (auto &v : int16Array) v = 0;
    auto point3 = std::chrono::high_resolution_clock::now();
    for (auto &v : int32Array) v = 0;
    auto point4 = std::chrono::high_resolution_clock::now();
    for (auto &v : int64Array) v = 0;
    auto point5 = std::chrono::high_resolution_clock::now();
    std::cout << "Time of processing int8 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point2 - point1)).count() << "us." << std::endl;
    std::cout << "Time of processing int16 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point3 - point2)).count() << "us." << std::endl;
    std::cout << "Time of processing int32 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point4 - point3)).count() << "us." << std::endl;
    std::cout << "Time of processing int64 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point5 - point4)).count() << "us." << std::endl;
}

int main()
{
    PutZero();
    std::cout << std::endl << "Press enter to exit" << std::endl;
    std::cin.get();
    return 0;
}

我使用以下命令在Linux下进行编译：g++ -o array_issue_1 main.cpp -O3 -std=c++14

我的结果如下：

Time of processing int8 array:  9922us.   
Time of processing int16 array: 37717us.   
Time of processing int32 array: 76064us.   
Time of processing int64 array: 146803us.

如果我使用-O2编译，那么对于int8，结果会恶化5倍！你也可以在Windows上编译此源代码。您将获得类似的结果关系。 更新＃1 当我使用-O2编译时，我的结果如下：

Time of processing int8 array:  60182us.  
Time of processing int16 array: 77807us.  
Time of processing int32 array: 114204us.  
Time of processing int64 array: 186664us.

我没有分析汇编输出。我的主要观点是我想在C++中编写高效的代码，像std::array这样的东西从性能角度来看可能具有挑战性，并且在某种程度上是违反直觉的。

- wrealcon

2

你看过汇编输出了吗？ - Ben Steffan

我编写了一个基准测试，其中测试了许多内容，包括测试BYTE、WORD、DWORD和QWORD写入的成本。结果是，在现代英特尔CPU上，它们都只需要1个周期，除非它们跨越缓存行边界，此时需要2个周期。请注意，如果您在字节粒度位置随机写入数据，则更大的值更有可能跨越缓存行边界。实际上，这个基准测试和大多数代码将使用对齐的缓冲区，因此根本不会发生跨越边界的情况。 - BeeOnRope

{btsdaf} - philipxy

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mysticial · Accepted Answer

为什么32位系统保存32位值所需的时间比保存8位值多4倍？

实际上不是这样的。但是你的基准测试存在三个不同的问题，导致了这些结果：

1. 没有进行预取内存。因此在基准测试期间会出现页面错误（page faults）。 2. 使用O3优化的编译器完全打败了你的基准测试，将所有循环转换为memset()函数。 3. 基准测试受到内存限制。因此你测量的是内存速度而不是核心速度。

问题1：测试数据未预先缓存

您已经声明了数组，但在基准测试之前没有使用它们。由于内核和内存分配的工作方式，它们尚未映射到内存中。只有当您第一次触摸它们时才会发生这种情况。当它发生时，内核需要付出很大的代价来映射页面。

可以通过在基准测试之前触摸所有数组来完成此操作。

无预缓存: http://coliru.stacked-crooked.com/a/1df1f3f9de420d18

g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array:  28983us.
Time of processing int16 array: 57100us.
Time of processing int32 array: 113361us.
Time of processing int64 array: 224451us.

使用Pre-Faulting技术: http://coliru.stacked-crooked.com/a/7e62b9c7ca19c128

g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array:  6216us.
Time of processing int16 array: 12472us.
Time of processing int32 array: 24961us.
Time of processing int64 array: 49886us.

时间下降了大约4倍。换句话说，您原来的基准测试测量的更多是内核而不是实际代码。

问题2：编译器击败了基准测试

编译器能够识别你写零的模式，并且完全用调用memset()来替代所有循环。因此，实际上你在测量不同大小的memset()调用。

  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 67108864
  mov edi, OFFSET FLAT:int8Array
  mov r14, rax
  call memset
  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 134217728
  mov edi, OFFSET FLAT:int16Array
  mov r13, rax
  call memset
  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 268435456
  mov edi, OFFSET FLAT:int32Array
  mov r12, rax
  call memset
  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 536870912
  mov edi, OFFSET FLAT:int64Array
  mov rbp, rax
  call memset
  call std::chrono::_V2::system_clock::now()

实现此功能的优化选项是-ftree-loop-distribute-patterns。即使您关闭了它，向量化器也会给您带来类似的效果。

使用-O2，向量化和模式识别都被禁用。因此编译器会按照你所写的给出结果。

.L4:
  mov BYTE PTR [rax], 0         ;; <<------ 1 byte at a time
  add rax, 1
  cmp rdx, rax
  jne .L4
  call std::chrono::_V2::system_clock::now()
  mov rbp, rax
  mov eax, OFFSET FLAT:int16Array
  lea rdx, [rax+134217728]
.L5:
  xor ecx, ecx
  add rax, 2
  mov WORD PTR [rax-2], cx      ;; <<------ 2 bytes at a time
  cmp rdx, rax
  jne .L5
  call std::chrono::_V2::system_clock::now()
  mov r12, rax
  mov eax, OFFSET FLAT:int32Array
  lea rdx, [rax+268435456]
.L6:
  mov DWORD PTR [rax], 0        ;; <<------ 4 bytes at a time
  add rax, 4
  cmp rax, rdx
  jne .L6
  call std::chrono::_V2::system_clock::now()
  mov r13, rax
  mov eax, OFFSET FLAT:int64Array
  lea rdx, [rax+536870912]
.L7:
  mov QWORD PTR [rax], 0        ;; <<------ 8 bytes at a time
  add rax, 8
  cmp rdx, rax
  jne .L7
  call std::chrono::_V2::system_clock::now()

使用 -O2 编译选项: http://coliru.stacked-crooked.com/a/edfdfaaf7ec2882e

g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array:  28414us.
Time of processing int16 array: 22617us.
Time of processing int32 array: 32551us.
Time of processing int64 array: 56591us.

现在很清楚，较小的字长速度较慢。但是如果所有字长的速度相同，您会预期时间是固定的。它们之所以不是这样，是因为受到内存带宽的影响。

问题3：内存带宽

由于基准测试（按照现有写法）仅写入0，因此很容易使核心/系统的内存带宽饱和。因此，基准测试受到触及内存量的影响。

为解决这个问题，我们需要缩小数据集大小，以适应缓存。为了弥补这一点，我们需要多次循环同一组数据。

std::array<std::int8_t, 512> int8Array;
std::array<std::int16_t, 512> int16Array;
std::array<std::int32_t, 512> int32Array;
std::array<std::int64_t, 512> int64Array;

...

auto point1 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int8Array) v = 0;
auto point2 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int16Array) v = 0;
auto point3 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int32Array) v = 0;
auto point4 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int64Array) v = 0;
auto point5 = std::chrono::high_resolution_clock::now();

现在我们看到不同字长的时间更加平坦：

http://coliru.stacked-crooked.com/a/f534f98f6d840c5c

g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array:  20487us.
Time of processing int16 array: 21965us.
Time of processing int32 array: 32569us.
Time of processing int64 array: 26059us.

之所以它不是完全平坦的，可能是因为编译器优化涉及许多其他因素。您可能需要采取循环展开来更接近目标。