Eigen C++：稀疏矩阵操作的性能

Question

Eigen C++：稀疏矩阵操作的性能

3

有人能解释一下Eigen稀疏矩阵的以下行为吗？我已经研究了别名和惰性求值，但似乎无法改进这个问题。技术规格：我在Ubuntu 16.10上使用最新的Eigen稳定版本和g++编译器，没有优化标志。

假设我以以下方式定义一个简单的身份：

SparseMatrix<double> spIdent(N,N);
spIdent.reserve(N);
spIdent.setIdentity();

然后对其执行这些操作

spIdent-spIdent;
spIdent*spIdent;
spIdent - spIdent*spIdent;

并测量所有三个的计算时间。我得到的结果是这样的

0 Computation time: 2.6e-05
1 Computation time: 2e-06 
2 Computation time: 1.10706

意思是任何一种操作都很快，但组合起来就变得非常慢。 noalias() 方法仅适用于密集矩阵，在我提供的密集示例中并没有太大的区别。有什么启示吗？

MCVE:

#include <iostream>
#include <ctime>
#include "../Eigen/Sparse"

using namespace std;
using namespace Eigen;

int main() {

unsigned int N=2000000;

SparseMatrix<double> spIdent(N,N);
spIdent.reserve(N);
spIdent.setIdentity();

clock_t start=clock();
spIdent*spIdent;
cout << "0 Computation time: " << float(clock() - start)/1e6 << '\n';

start=clock();
spIdent-spIdent;
cout << "1 Computation time: " << float(clock() - start)/1e6 << '\n';

start=clock();
spIdent - (spIdent*spIdent);
cout << "2 Computation time: " << float(clock() - start)/1e6 << '\n';

return 0;

}

- Tim

1

使用优化标志？ - Avi Ginsburg

1

问题并不仅仅是让它更快...而是要理解为什么第三种情况如此缓慢！ - Tim

5

前两个可能被优化掉了，代码不会产生任何效果。第三个没有被优化掉。 - Yakk - Adam Nevraumont

在没有开启优化的情况下询问速度，是徒劳无功的练习。 - Martin Bonner supports Monica

2

你能否创建一个能够复现你时间测试结果的MCVE代码呢？http://stackoverflow.com/help/mcve - kangshiyin

显示剩余3条评论

3个回答

4

它并不是被优化掉了，而是因为惰性求值非常懒惰。看看这个函数。在这台机器上调用的代码是（至少在包含Eigen的那个版本中）：

template<typename Derived>
template<typename OtherDerived>
inline const typename SparseSparseProductReturnType<Derived,OtherDerived>::Type
SparseMatrixBase<Derived>::operator*(const SparseMatrixBase<OtherDerived> &other) const
{
  return typename SparseSparseProductReturnType<Derived,OtherDerived>::Type(derived(), other.derived());
}

该函数返回一个表示乘积的表达式（即惰性）。对这个表达式什么也没做，所以成本为零。差值也一样。现在，在执行 a-a*a 时，a*a 是一个表达式。然后它遇到了 operator-。这会看到右侧的表达式。然后对此表达式进行评估以产生临时值（即需要时间），以便在 operator- 中使用它。为什么要评估为临时值？请阅读此处获取其逻辑（搜索“第二种情况”）。

operator-是一个带有乘积表达式作为右侧的CwiseBinaryOp。 CwiseBinaryOp 做的第一件事就是将右侧赋值给一个成员：

EIGEN_STRONG_INLINE CwiseBinaryOp(const Lhs& aLhs, const Rhs& aRhs, const BinaryOp& func = BinaryOp())
      : m_lhs(aLhs), m_rhs(aRhs), m_functor(func)

(m_rhs(aRhs)) 调用一个 SparseMatrix 构造函数:

/** Constructs a sparse matrix from the sparse expression \a other */
template<typename OtherDerived>
inline SparseMatrix(const SparseMatrixBase<OtherDerived>& other)
  : m_outerSize(0), m_innerSize(0), m_outerIndex(0), m_innerNonZeros(0)
{
  ...
  *this = other.derived();
}

这将调用operator=，而operator=将(如果我说错了，请纠正)总是触发一个计算，本例中计算生成一个临时对象。

- Avi Ginsburg

1

那么，在表达式a-a*a中，是否有可能避免使用临时变量？ - AlQuemist

使用Eigen 3.2采用的自下而上的方法，我认为不行。Eigen 3.3（beta）可能采用自上而下的方法（我记得在某个地方看到过，但我可能记错了）。如果你问的是理论上是否可能，那当然可以。 - Avi Ginsburg

3

更精确地说，在Eigen 3.2中，第三个语句中的乘积会立即计算成一个临时变量。但在Eigen 3.3中不再是这样。因此，在Eigen 3.3中，这三个语句都将被简化为no-op。然而，即使使用3.3，一旦第三个语句被赋值给某个矩阵，那么该乘积也将在一个临时变量中进行评估。通过点积惰性地计算它将非常慢。在这种情况下，使用临时变量是值得的！ - ggael

感谢您的澄清。 - Avi Ginsburg

3

我认为正如@hfhc2所提到的，代码中的前两个语句由于结果在后面没有被使用而被编译器完全优化掉了。在第三个语句中，很可能会产生一个辅助中间变量来存储spIdent*spIdent的临时结果。为了更清楚地看到这一点，请考虑以下包括显式复制赋值的示例：

#include <iostream>
#include <ctime>
#include <Eigen/Sparse>

using namespace std;
using namespace Eigen;

int main () {

   const unsigned int N = 2000000;

   SparseMatrix<double> spIdent(N,N);
   SparseMatrix<double> a(N,N), b(N,N), c(N,N);

   spIdent.reserve(N);
   spIdent.setIdentity();

   clock_t start = clock();
   a = spIdent*spIdent;
   cout << "0 Computation time: " << float(clock() - start)/1e6 << endl;

   start = clock();
   b = spIdent-spIdent;
   cout << "1 Computation time: " << float(clock() - start)/1e6 << endl;

   start = clock();
   c = a - b;
   cout << "2 Computation time: " << float(clock() - start)/1e6 << endl;

   return 0;

}

测量的时间（未经编译器优化）为[针对openSUSE 12.2（x86_64），g++ 4.7.1，Intel 2核2GHz CPU]：

0 Computation time: 1.58737
1 Computation time: 0.417798
2 Computation time: 0.428174

这似乎是相当合理的。

- AlQuemist

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hfhc2 · Accepted Answer

好的，正如其他人提到的那样，在前两个语句中，代码被完全优化掉了（我已经使用当前版本的g ++和-O3进行了测试）。对于第二条语句，反汇编显示了这一点：

  400e78:   e8 03 fe ff ff          callq  400c80 <clock@plt>   # timing begins
  400e7d:   48 89 c5                mov    %rax,%rbp
  400e80:   e8 fb fd ff ff          callq  400c80 <clock@plt>   # timing ends

对于第三部分，实际上发生了一些事情，一个Eigen库的代码被调用：

  400ede:   e8 9d fd ff ff          callq  400c80 <clock@plt>   # timing begins
  400ee3:   48 89 c5                mov    %rax,%rbp
  400ee6:   8b 44 24 58             mov    0x58(%rsp),%eax
  400eea:   39 44 24 54             cmp    %eax,0x54(%rsp)
  400eee:   c6 44 24 20 00          movb   $0x0,0x20(%rsp)
  400ef3:   48 89 5c 24 28          mov    %rbx,0x28(%rsp)
  400ef8:   48 89 5c 24 30          mov    %rbx,0x30(%rsp)
  400efd:   48 c7 44 24 38 00 00    movq   $0x0,0x38(%rsp)
  400f04:   00 00 
  400f06:   c6 44 24 40 01          movb   $0x1,0x40(%rsp)
  400f0b:   0f 85 99 00 00 00       jne    400faa <main+0x22a>
  400f11:   48 8d 4c 24 1f          lea    0x1f(%rsp),%rcx
  400f16:   48 8d 54 24 20          lea    0x20(%rsp),%rdx
  400f1b:   48 8d bc 24 90 00 00    lea    0x90(%rsp),%rdi
  400f22:   00 
  400f23:   48 89 de                mov    %rbx,%rsi
  400f26:   e8 25 1a 00 00          callq  402950 <_ZN5Eigen13CwiseBinaryOpINS_8internal20scalar_difference_opIdEEKNS_12SparseMatrixIdLi0EiEEKNS_19SparseSparseProductIRS6_S8_EEEC1ES8_RSA_RKS3_>
  400f2b:   48 8d bc 24 a0 00 00    lea    0xa0(%rsp),%rdi
  400f32:   00 
  400f33:   e8 18 02 00 00          callq  401150 <_ZN5Eigen12SparseMatrixIdLi0EiED1Ev>
  400f38:   e8 43 fd ff ff          callq  400c80 <clock@plt>   # timing ends

我猜在这种情况下，编译器无法确定计算结果未被使用，与前两种情况不同。

如果您查看文档，则可以看到像+这样的简单操作在稀疏矩阵上不返回矩阵，而是表示结果的CwiseUnaryOp。我想，如果您未在某个地方使用此类，则不会构造出结果矩阵。