我有一个C++程序,基本上执行一些矩阵计算。对于这些计算,我使用LAPACK / BLAS,并根据平台通常链接到MKL或ACML。许多这些矩阵计算作用于不同的独立矩阵,因此我使用std :: thread的方法来让这些操作并行运行。然而,我发现使用更多线程时没有加速。我将问题追踪到daxpy Blas例程。似乎如果两个线程同时使用此例程,则每个线程需要的时间会增加一倍,即使两个线程作用于不同的数组。
接下来我尝试编写一个新的简单方法来执行向量加法,以替换daxpy例程。使用一个线程,这种新方法与BLAS例程一样快,但是,在使用gcc编译时,它遇到了与BLAS例程相同的问题:将并行运行的线程数量翻倍也会使每个线程所需的时间翻倍,因此没有加速效果。但是,使用Intel C ++编译器,则不会出现这些问题:随着线程数的增加,单个线程所需的时间保持恒定。
但是,在没有Intel编译器的系统上,我也需要进行编译。因此,我的问题是:为什么使用gcc没有速度提升,是否有可能提高gcc性能?
我编写了一个小程序来演示这种情况:
// $(CC) -std=c++11 -O2 threadmatrixsum.cpp -o threadmatrixsum -pthread
#include <iostream>
#include <thread>
#include <vector>
#include "boost/date_time/posix_time/posix_time.hpp"
#include "boost/timer.hpp"
void simplesum(double* a, double* b, std::size_t dim);
int main() {
for (std::size_t num_threads {1}; num_threads <= 4; num_threads++) {
const std::size_t N { 936 };
std::vector <std::size_t> times(num_threads, 0);
auto threadfunction = [&](std::size_t tid)
{
const std::size_t dim { N * N };
double* pA = new double[dim];
double* pB = new double[dim];
for (std::size_t i {0}; i < N; ++i){
pA[i] = i;
pB[i] = 2*i;
}
boost::posix_time::ptime now1 =
boost::posix_time::microsec_clock::universal_time();
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
boost::posix_time::ptime now2 =
boost::posix_time::microsec_clock::universal_time();
boost::posix_time::time_duration dur = now2 - now1;
times[tid] += dur.total_milliseconds();
delete[] pA;
delete[] pB;
};
std::vector <std::thread> mythreads;
// start threads
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads.emplace_back(threadfunction, n);
}
// wait for threads to finish
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads[n].join();
std::cout << " Thread " << n+1 << " of " << num_threads
<< " took " << times[n]<< "msec" << std::endl;
}
}
}
void simplesum(double* a, double* b, std::size_t dim){
for(std::size_t i{0}; i < dim; ++i)
{*(++a) += *(++b);}
}
使用gcc编译的输出:
Thread 1 of 1 took 532msec
Thread 1 of 2 took 1104msec
Thread 2 of 2 took 1103msec
Thread 1 of 3 took 1680msec
Thread 2 of 3 took 1821msec
Thread 3 of 3 took 1808msec
Thread 1 of 4 took 2542msec
Thread 2 of 4 took 2536msec
Thread 3 of 4 took 2509msec
Thread 4 of 4 took 2515msec
使用 ICC 配置的输出:
Thread 1 of 1 took 663msec
Thread 1 of 2 took 674msec
Thread 2 of 2 took 674msec
Thread 1 of 3 took 681msec
Thread 2 of 3 took 681msec
Thread 3 of 3 took 681msec
Thread 1 of 4 took 688msec
Thread 2 of 4 took 689msec
Thread 3 of 4 took 687msec
Thread 4 of 4 took 688msec
因此,使用icc,一个线程执行计算所需的时间是恒定的(正如我所预期的;我的CPU有4个物理核心),而使用gcc,则一个线程的时间会增加。通过将simplesum例程替换为BLAS :: daxpy,可以获得与icc和gcc相同的结果(毫不奇怪,因为大部分时间都花在库中),这些结果几乎与上述gcc结果相同。
*(a++) += *(b++)
吗? - danielschemmel