为什么Tensorflow和Scipy计算的Pearson相关系数不同？

Question

为什么Tensorflow和Scipy计算的Pearson相关系数不同？

13

我以两种方式计算Pearson相关性：

在Tensorflow中，我使用以下度量标准：

tf.contrib.metrics.streaming_pearson_correlation(y_pred, y_true)

当我用测试数据评估我的网络时，我得到了以下结果:

损失 = 0.5289223349094391

皮尔森相关系数 = 0.3701728057861328

(损失是mean_squared_error)

然后我使用Scipy预测测试数据并计算相同的指标：

import scipy.stats as measures
per_coef = measures.pearsonr(y_pred, y_true)[0]
mse_coef = np.mean(np.square(np.array(y_pred) - np.array(y_true)))

以下是我的结果：

Pearson = 0.5715300096509959

MSE = 0.5289223312665985

这是已知的问题吗？这正常吗？

最小、完整和可验证的示例

import tensorflow as tf
import scipy.stats as measures

y_pred = [2, 2, 3, 4, 5, 5, 4, 2]
y_true = [1, 2, 3, 4, 5, 6, 7, 8]

## Scipy
val2 = measures.pearsonr(y_pred, y_true)[0]
print("Scipy's Pearson = {}".format(val2))

## Tensorflow
logits = tf.placeholder(tf.float32, [8])
labels = tf.to_float(tf.Variable(y_true))

acc, acc_op = tf.contrib.metrics.streaming_pearson_correlation(logits,labels)

sess = tf.Session()
sess.run(tf.local_variables_initializer())
sess.run(tf.global_variables_initializer())
sess.run(acc, {logits:y_pred})
sess.run(acc_op, {logits:y_pred})

print("Tensorflow's Pearson:{}".format(sess.run(acc,{logits:y_pred})))

- Astariul

2

如果您提供了一个最小化、完整和可验证的示例，那么其他人帮助您会更容易。您是否看到了一个简单数据集的类似差异，例如 y_pred = [2, 2, 3, 4, 5, 5, 4, 2]，y_true = [1, 2, 3, 4, 5, 6, 7, 8]？ - Warren Weckesser

我看到了一个差异，但是很小。你认为这个小差异是原因吗？为什么会有任何差异呢？ - Astariul

您的 MCVE 中，无论是 TensorFlow 还是 Scipy，在我的所有测试中皆显示 Pearson 相关系数为 0.3806076。 - 0xsx

是的，对我来说这些数字后面的差异也是如此。 - Astariul

如果您使用float64而不是float32运行tensorflow代码会发生什么？ - Warren Weckesser

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Kilian Batzner · Accepted Answer

在你提供的最小可验证示例中，y_pred和y_true是整数列表。在scipy.stats.measures.pearsonr source的第一行中，您将看到输入被转换为numpy数组，使用x = np.asarray(x)。我们可以通过以下方式查看这些数组的结果数据类型：

print(np.asarray(y_pred).dtype)  # Prints 'int64'

当两个int64数字相除时，SciPy使用float64精度，而TensorFlow在上面的示例中将使用float32精度。即使是单个除法，差异也可能很大：

>>> '%.15f' % (8.5 / 7)
'1.214285714285714'
>>> '%.15f' % (np.array(8.5, dtype=np.float32) / np.array(7, dtype=np.float32))
'1.214285731315613'
>>> '%.15f' % (np.array(8.5, dtype=np.float32) / np.array(7, dtype=np.float32) - 8.5 / 7)
'0.000000017029899'

您可以通过使用y_pred和y_true的float32精度来获得SciPy和TensorFlow的相同结果：

import numpy as np
import tensorflow as tf
import scipy.stats as measures

y_pred = np.array([2, 2, 3, 4, 5, 5, 4, 2], dtype=np.float32)
y_true = np.array([1, 2, 3, 4, 5, 6, 7, 8], dtype=np.float32)

## Scipy
val2 = measures.pearsonr(y_pred, y_true)[0]
print("Scipy's Pearson: \t\t{}".format(val2))

## Tensorflow
logits = tf.placeholder(tf.float32, [8])
labels = tf.to_float(tf.Variable(y_true))

acc, acc_op = tf.contrib.metrics.streaming_pearson_correlation(logits,labels)

sess = tf.Session()
sess.run(tf.local_variables_initializer())
sess.run(tf.global_variables_initializer())
sess.run(acc, {logits:y_pred})
sess.run(acc_op, {logits:y_pred})

print("Tensorflow's Pearson: \t{}".format(sess.run(acc,{logits:y_pred})))

打印

Scipy's Pearson:        0.38060760498046875
Tensorflow's Pearson:   0.38060760498046875

SciPy和TensorFlow计算之间的差异

你报告的测试分数之间的差异相当大。我查看了source，发现以下差异：

1. 更新操作

tf.contrib.metrics.streaming_pearson_correlation 的结果不是无状态的。它返回相关系数操作以及用于新传入数据的 update_op。如果在实际调用系数操作的 y_pred 之前使用不同的数据调用更新操作，则会得到完全不同的结果：

sess.run(tf.global_variables_initializer())

for _ in range(20):
    sess.run(acc_op, {logits: np.random.randn(*y_pred.shape)})

print("Tensorflow's Pearson: \t{}".format(sess.run(acc,{logits:y_pred})))

打印

Scipy's Pearson:        0.38060760498046875
Tensorflow's Pearson:   -0.0678008571267128

2. 不同的公式

SciPy：

TensorFlow：

虽然在数学上相同，但在TensorFlow中计算相关系数的方法不同。它使用样本协方差来计算(x, x), (x, y)和(y, y)的相关系数，这可能会引入不同的舍入误差。