为什么 TensorFlow 和 PyTorch 梯度计算与特征值分解的解析解不同？

Question

为什么 TensorFlow 和 PyTorch 梯度计算与特征值分解的解析解不同？

numpytensorflowpytorchderivativeautomatic-differentiation

3

下面的代码计算了实对称矩阵的特征值分解，然后计算出第一个特征值相对于矩阵的梯度。这个过程重复了三次：1）使用解析公式，2）使用TensorFlow，3）使用PyTorch。这得到了三个不同的结果。有人能解释一下这种行为吗？

import numpy as np
import torch
import tensorflow as tf


np.set_printoptions(precision=3)
np.random.seed(123)

# random matrix
matrix_np = np.random.randn(4, 4)
# make symmetric
matrix_np = matrix_np + matrix_np.T
matrix_torch = torch.autograd.Variable(torch.from_numpy(matrix_np), requires_grad=True)
matrix_tf = tf.constant(matrix_np, dtype=tf.float64)

#
# compute eigenvalue decompositions
#
# NumPy
eigvals_np, eigvecs_np = np.linalg.eigh(matrix_np)
# PyTorch
eigvals_torch, eigvecs_torch = torch.symeig(matrix_torch, eigenvectors=True, upper=True)
# TensorFlow
eigvals_tf, eigvecs_tf = tf.linalg.eigh(matrix_tf)

# make sure all three versions computed the same eigenvalues
if not np.allclose(eigvals_np, eigvals_torch.data.numpy()):
    print('NumPy and PyTorch have different eigenvalues')
if not np.allclose(eigvals_np, tf.keras.backend.eval(eigvals_tf)):
    print('NumPy and TensorFlow have different eigenvalues')

#
# compute derivative of first eigenvalue with respect to the matrix
#
# analytic gradient, see "On differentiating eigenvalues and eigenvectors" by Jan R. Magnus
grad_analytic = np.outer(eigvecs_np[:, 0], eigvecs_np[:, 0])
# PyTorch gradient
eigvals_torch[0].backward()
grad_torch = matrix_torch.grad.numpy()
# TensorFlow gradient
grad_tf = tf.gradients(eigvals_tf[0], matrix_tf)[0]
grad_tf = tf.keras.backend.eval(grad_tf)

#
# print all derivatives
#
print('-'*6, 'analytic gradient', '-'*6)
print(grad_analytic)
print('-'*6, 'Pytorch gradient', '-'*6)
print(grad_torch)
print('-'*6, 'TensorFlow gradient', '-'*6)
print(grad_tf)

打印

------ analytic gradient ------
[[ 0.312 -0.204 -0.398 -0.12 ]
 [-0.204  0.133  0.26   0.079]
 [-0.398  0.26   0.509  0.154]
 [-0.12   0.079  0.154  0.046]]
------ Pytorch gradient ------
[[ 0.312 -0.407 -0.797 -0.241]
 [ 0.     0.133  0.52   0.157]
 [ 0.     0.     0.509  0.308]
 [ 0.     0.     0.     0.046]]
------ TensorFlow gradient ------
[[ 0.312  0.     0.     0.   ]
 [-0.407  0.133  0.     0.   ]
 [-0.797  0.52   0.509  0.   ]
 [-0.241  0.157  0.308  0.046]]

这三个结果的主对角线相同。TensorFlow 和 PyTorch 的非对角元素是解析元素的两倍或等于零。

这是预期行为吗？为什么没有记录？梯度是否有误？

版本信息：TensorFlow 1.14.0，PyTorch 1.0.1

- Maryks

2

FYI：我使用Ppytorch 1.3运行了您的代码，并且matrix_torch.grad.numpy()对我来说与解析梯度相同。 - hdkrgr

谢谢！我发现了这个问题 https://github.com/pytorch/pytorch/pull/23018 显然，下三角矩阵确实是一个错误，PyTorch已经修复了它。所以TensorFlow也可能存在这个问题。 - Maryks

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Harry Slatyer · Answer 1

对于保证对称的矩阵的梯度在非对角线上并不是真正被定义好了，因为一个有效的实现可以依赖于元素或它的相反数（或两者的加权和）。

例如，一个将2x2对称矩阵 x 元素求和的有效函数的实现如下：

f(x) = x[0][0]+x[0][1]+x[1][0]+x[1][1]

但另一种有效的实现方式可能是

f(x) = x[0][0]+x[1][1]+2*x[0][1]

如果对称矩阵是更大计算的一部分，保证矩阵始终对称（例如x = [[a，b]，[b，c]]，其中a，b和c是一些标量），那么更大计算的梯度不受您如何定义函数-对称矩阵的梯度的影响（在我这里运行的示例中，我们将具有df/da = df/dc = 1和df/db = 2，无论您如何定义f）。

话虽如此，对称梯度是一个不错的选择（如评论中链接的PyTorch PR所解释的那样），因为这意味着如果您恰好在对称矩阵上进行梯度下降更新，则矩阵保持对称。

此外，请注意 TensorFlow 文档中指出，计算时仅使用矩阵的下三角部分，并相应地调整报告的梯度。