为什么在Eigen库中，矩阵加法比矩阵-向量乘法慢？

Question

为什么在Eigen库中，矩阵加法比矩阵-向量乘法慢？

5

为什么矩阵加法比矩阵-向量乘法要慢得多？

矩阵加法只需要n^2次加法，而矩阵-向量乘法需要n*(n-1)次加法和n^2次乘法。

但是，在Eigen中，矩阵加法的时间是矩阵-向量乘法的两倍。是否存在任何选项来加速Eigen中的矩阵加法操作？

#include <eigen3/Eigen/Eigen>
#include <iostream>
#include <ctime>
#include <string>
#include <chrono>
#include <fstream>
#include <random>
#include <iomanip>

using namespace Eigen;
using namespace std;

int main()
{
const int l=100;
MatrixXf m=MatrixXf::Random(l,l);
MatrixXf n=MatrixXf::Random(l,l);
VectorXf v=VectorXf::Random(l,1);

MatrixXf qq=MatrixXf::Random(l,1);
MatrixXf pp=MatrixXf::Random(l,l);

auto start = chrono::steady_clock::now();
for(int j=0;j<10000;j++)
qq=m*v;
auto end = chrono::steady_clock::now();
double time_duration=chrono::duration_cast<chrono::milliseconds>(end - start).count();
std::cout << setprecision(6) << "Elapsed time in seconds : "<< time_duration/1000<< "s" << std::endl;
auto start1 = chrono::steady_clock::now();
for(int j=0;j<10000;j++)
pp=m+n;
auto end1 = chrono::steady_clock::now();
double time_duration1=chrono::duration_cast<chrono::milliseconds>(end1 - start1).count();
std::cout << setprecision(6) << "Elapsed time in seconds : "<< time_duration1/1000<< "s" << std::endl;
}

测试1：无任何优化：

编译命令：g++-8 -test.cpp -o test

运行命令：./test

用时（秒）：0.323秒

用时（秒）：0.635秒

测试2：使用-march=native优化：

编译命令：g++-8 test.cpp -march=native -o test

运行命令：./test

用时（秒）：0.21秒

用时（秒）：0.372秒

测试3：使用-O3优化：

编译命令：g++-8 -test.cpp -O3 -o test

运行命令：./test

用时（秒）：0.009秒

用时（秒）：0.016秒

测试4：使用-march=native，-O3优化：

编译命令：g++-8 -test.cpp -march=native -O3 -o test

运行命令：./test

用时（秒）：0.008秒

用时（秒）：0.016秒

==============

我注意到有评论说编译器可能作弊，因为我没有使用前一次迭代的结果。为了解决这个问题，我改为进行一次迭代，并使用更大的数据量进行稳定的时间统计。

#include <eigen3/Eigen/Eigen>
#include <iostream>
#include <ctime>
#include <string>
#include <chrono>
#include <fstream>
#include <random>
#include <iomanip>

using namespace Eigen;
using namespace std;

int main()
{
const int l=1000;
MatrixXf m=MatrixXf::Random(l,l);
MatrixXf n=MatrixXf::Random(l,l);
VectorXf v=VectorXf::Random(l,1);

MatrixXf qq=MatrixXf::Random(l,1);
MatrixXf pp=MatrixXf::Random(l,l);

auto start = chrono::steady_clock::now();
qq=m*v;
auto end = chrono::steady_clock::now();
double time_duration=chrono::duration_cast<chrono::microseconds>(end - start).count();

auto start1 = chrono::steady_clock::now();
pp=m+n;
auto end1 = chrono::steady_clock::now();
double time_duration1=chrono::duration_cast<chrono::microseconds>(end1 - start1).count();
std::cout << setprecision(6) << "Elapsed time in microseconds : "<< time_duration<< "us" << std::endl;
std::cout << setprecision(6) << "Elapsed time in microseconds : "<< time_duration1<< "us" << std::endl;
}

测试1：没有任何优化：

编译命令：g++-8 -test.cpp -o test

运行命令：./test

微秒级的经过时间：3125us

微秒级的经过时间：6849us

测试2：使用-march=native优化：

编译命令：g++-8 test.cpp -march=native -o test

运行命令：./test

微秒级的经过时间：1776us

微秒级的经过时间：3815us

测试3：使用-O3优化：

编译命令：g++-8 -test.cpp -O3 -o test

运行命令：./test

微秒级的经过时间：449us

微秒级的经过时间：760us

测试4：使用-march=native、-O3优化：

编译命令：g++-8 -test.cpp -march=native -O3 -o test

运行命令：./test

微秒级的经过时间：351us

微秒级的经过时间：871us

- complexfilter

你验证了循环没有被优化掉吗？编译器可能会认为运行10000次循环的效果与只运行一次相同... - Max Langhof

是的，我做了。请查看新的更新说明。它仍然需要更长时间。@MaxLanghof - complexfilter

你改变了编译标志，但仍未使用结果！你可以在每次加法或乘法后执行类似于 static float x = 0.0; x = x+qq[0][0]; x = x +pp[0] 的操作，以便编译器必须使用结果。(并在最后输出 cout << x) - Jeffrey

谢谢你的建议，@Jeffrey。我改为使用一次迭代来演示。 - complexfilter

不是你的确切问题，但是关于缓存如何/为什么重要的写作，请参见https://dev59.com/K2gu5IYBdhLWcg3wUljt。 - o11c

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ggael · Accepted Answer

简短回答：您计算了操作次数，但忽略了内存访问次数。加法情况下，负载几乎是x2倍昂贵的，详细信息如下。

首先，由于现代CPU能够同时执行一次独立的加法和乘法，因此两种运算的实际操作次数相同。像x*y+z这样的两个顺序乘加甚至可以融合为单个操作，其成本与1次加法或1次乘法相同。如果您的CPU支持FMA，则使用-march=native时会发生这种情况，但我怀疑FMA在这里没有任何作用。

其次，在您的计算中，您忘记了测量内存访问次数。请记住，除非数据已经在L1缓存中，否则一个内存负载比一个加法或一个乘法要昂贵得多。

对于加法而言，很容易理解：我们有2*n^2个负载，其中大量缓存未命中，再加上n^2个存储。

对于列主矩阵的矩阵向量积，输入向量仅读取一次，因此需要n^2+n次加载输入，由于列被一次处理4列，因此我们需要n^2/4次读写输出向量，但几乎不会出现缓存未命中，因为它适合L1缓存。因此，加法的内存负载几乎是矩阵向量积的两倍昂贵，因此x2速度因子并不异常。

此外，矩阵向量代码更加积极地优化了显式循环剥离，尽管我怀疑在这个基准测试中这将不会有任何区别，因为您的矩阵根本不适合L1缓存。