32位整数的排序速度与64位整数相同

Question

32位整数的排序速度与64位整数相同

c++cachingmemoryint

3

以下是我认为的事实：

快速排序应该相当友好地使用缓存。
一个缓存行有64个字节，可以包含16个32位整数或8个64位整数。

假说：

对32位整数向量进行排序应该比对64位整数向量进行排序更快。

但是当我运行下面的代码时，得到的结果是：

i16 = 7.5168 
i32 = 7.3762 
i64 = 7.5758

为什么我得不到想要的结果？

C++：

#include <iostream> 
#include <vector>
#include <cstdint>
#include <algorithm>
#include <chrono>


int main() {
  const int vlength = 100'000'000;
  const int maxI = 50'000;

  std::vector<int16_t> v16;
  for (int i = 0; i < vlength; ++i) {
    v16.push_back(int16_t(i%maxI));
  }
  std::random_shuffle(std::begin(v16), std::end(v16));
  std::vector<int32_t> v32;
  std::vector<int64_t> v64;
  for (int i = 0; i < vlength; ++i) {
    v32.push_back(int32_t(v16[i]));
    v64.push_back(int64_t(v16[i]));
  }

  auto t1 = std::chrono::high_resolution_clock::now();
  std::sort(std::begin(v16), std::end(v16));
  auto t2 = std::chrono::high_resolution_clock::now();
  std :: cout << "i16 = " << (std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)).count() << std :: endl;

  t1 = std::chrono::high_resolution_clock::now();
  std::sort(std::begin(v32), std::end(v32));
  t2 = std::chrono::high_resolution_clock::now();
  std :: cout << "i32 = " << (std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)).count() << std :: endl;

  t1 = std::chrono::high_resolution_clock::now();
  std::sort(std::begin(v64), std::end(v64));
  t2 = std::chrono::high_resolution_clock::now();
  std :: cout << "i64 = " << (std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)).count() << std :: endl;

}

编辑：为了避免关于缓存友好排序的问题，我也尝试了以下代码：

template <typename T>
inline void function_speed(T& vec) {
  for (auto& i : vec) {
    ++i;
  }
}

int main() {
  const int nIter = 1000;

  std::vector<int16_t> v16(1000000);
  std::vector<int32_t> v32(1000000);
  std::vector<int64_t> v64(1000000);



  auto t1 = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < nIter; ++i) {
    function_speed(v16);
  }
  auto t2 = std::chrono::high_resolution_clock::now();
  std :: cout << "i16 = " << (std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)).count()/double(nIter) << std :: endl;

  t1 = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < nIter; ++i) {
    function_speed(v32);
  }
  t2 = std::chrono::high_resolution_clock::now();
  std :: cout << "i32 = " << (std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)).count()/double(nIter) << std :: endl;

  t1 = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < nIter; ++i) {
    function_speed(v64);
  }
  t2 = std::chrono::high_resolution_clock::now();
  std :: cout << "i64 = " << (std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1)).count()/double(nIter) << std :: endl;

}

典型结果：

i16 = 0.00618648
i32 = 0.00617911
i64 = 0.00606275

我知道适当的基准测试本身就是一门科学，也许我做错了。

编辑2：通过避免溢出，我现在开始得到更有趣的结果：

template <typename T>
inline void function_speed(T& vec) {
  for (auto& i : vec) {
    ++i;
    i %= 1000;
  }
}

提供的结果如下：

i16 = 0.0143789
i32 = 0.00958941
i64 = 0.019691

如果我改为这样做：

template <typename T>
inline void function_speed(T& vec) {
  for (auto& i : vec) {
    i = (i+1)%1000;
  }
}

I get:

i16 = 0.00939448
i32 = 0.00913768
i64 = 0.019615

- Klein

7.5，是秒还是毫秒？ - Jean-François Fabre

你期望有什么不同（因素）？ - Jarod42

@Jean-FrançoisFabre：秒。 - Klein

@Jarod42：我不确定我期望的差异有多大，但至少应该是明显的。 - Klein

计数排序在处理 int16_t 的情况下实际上会比其他排序算法更快。 - MSalters

显示剩余2条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MSalters · Accepted Answer

3

错误的假设; 所有O(N log N)的排序算法对于N!种可能的输入中的绝大多数来说都是不友好的。

此外，我认为优化编译器可以直接删除排序，而未经优化的构建当然是无意义的基准测试。

- MSalters

我一直认为归并排序只要有至少三路缓存，就很友好。 - Mark Ransom

1

@MarkRansom：在合并大序列时，它最终会出现。而且，在操作单个缓存行时，早期的东西也不太糟糕（尽管您可能不会将归并排序使用到最后2个元素；相反，您会使用专用的排序器来处理缓存行级别）。中间部分看起来相当糟糕，我认为您仍然会有O（N log N）的缓存未命中。预取器可能会有所帮助。 - MSalters

我并不认为缓存不友好是性能差异缺乏原因。缓存不友好保证必须对所有三种整数大小进行 O(n log n) 内存访问。但在 int64 情况下，你仍需读取 4 倍于 int16 的字节数。所以如果这是一个受内存限制算法，那么基准测试应该有 4 倍的速度差异。 - Nicu Stiurca

其实，我再想了三秒钟，意识到在int16的情况下，即使每个内存提取每次比较都会获取2个有用数据字节，但架构仍然会拉取周围的62个字节，因此每个内存提取都是64个字节，而不管int中的位数。 - Nicu Stiurca

@NicuStiurca：没错，排序算法如何使用附近的值非常关键。问题在于，平均而言，附近的输入值在输出中距离数组的一半。你只需要进行有限次12字节的移动，但这里需要在超过100MB的距离上进行多次移动。 - MSalters