C++ OpenMP：在for循环内写入矩阵会显著减慢for循环速度

Question

C++ OpenMP：在for循环内写入矩阵会显著减慢for循环速度

3

我有以下代码。 bitCount 函数只计算64位整数中的位数。 test 函数是我在更复杂的代码中做类似事情的示例，其中我试图复制它如何使写入矩阵显着降低for循环性能，并且我正在尝试找出为什么会这样，以及是否有任何解决方法。

#include <vector>
#include <cmath>
#include <omp.h>

// Count the number of bits
inline int bitCount(uint64_t n){

  int count = 0;

  while(n){

    n &= (n-1);
    count++;

  }

  return count;

}


void test(){

  int nthreads = omp_get_max_threads();
  omp_set_dynamic(0);
  omp_set_num_threads(nthreads);

  // I need a priority queue per thread
  std::vector<std::vector<double> > mat(nthreads, std::vector<double>(1000,-INFINITY));
  std::vector<uint64_t> vals(100,1);

  # pragma omp parallel for shared(mat,vals)
  for(int i = 0; i < 100000000; i++){
    std::vector<double> &tid_vec = mat[omp_get_thread_num()];
    int total_count = 0;
    for(unsigned int j = 0; j < vals.size(); j++){
      total_count += bitCount(vals[j]);
      tid_vec[j] = total_count; // if I comment out this line, performance increase drastically
    }
  }

}

这段代码需要大约11秒钟才能运行。如果我注释掉以下代码：

tid_vec[j] = total_count;

代码运行需要大约2秒钟。在我的情况下，为什么写入矩阵会影响性能呢？

- Cauchy

1

根据您的编译器和选项，当您移除序列化存储时，内部循环约简可能会进行SIMD向量化。 - tim18

2

如果没有存储，那么for循环也不会执行，这也是事实。也许它被优化掉了？ - Robert Prévost

1

如果你想得到具體的答案而不是猜測，你必須提供編譯器版本、選項、硬件以及「最小完整可重現範例」（MCVE）等詳細信息。此外，請注意 bitcount 被廣泛稱為 popcnt，並已被優化至極致。 - Zulan

OT，但是... 1）许多X86处理器现在都有popcnt指令，2）如果您不想使用它，至少使用经典的popcnt，需要一个常量六个阶段来处理64b，而不是最多64个阶段。例如：https://dev59.com/IXjZa4cB1Zd3GeqPgqgu - Jim Cownie

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jorge Bellon · Accepted Answer

既然您没有提及编译器/系统规格，我假设您正在使用GCC和标志-O2 -fopenmp进行编译。

如果您注释掉此行：

tid_vec[j] = total_count;

编译器将优化掉所有没有被使用的计算结果。因此：

  total_count += bitCount(vals[j]);

优化也很重要。如果您的应用程序主要内核没有被使用，那么运行速度会更快。

另一方面，我不会自己实现位计数函数，而是依赖于已经为您提供的功能。例如，GCC内置函数包括__builtin_popcount，它正是您试图实现的功能。

额外的提示：最好在私有数据上工作，而不是使用不同数组元素的公共数组进行操作。它可以提高局部性（特别是当访问内存不均匀时，比如NUMA），并可能减少访问争用。

# pragma omp parallel shared(mat,vals)
{
std::vector<double> local_vec(1000,-INFINITY);
#pragma omp for
for(int i = 0; i < 100000000; i++) {
  int total_count = 0;
  for(unsigned int j = 0; j < vals.size(); j++){
    total_count += bitCount(vals[j]);
    local_vec[j] = total_count;
  }
}
// Copy local vec to tid_vec[omp_get_thread_num()]
}