TensorFlow中while_loop()的非确定性行为

Question

TensorFlow中while_loop()的非确定性行为

33

我使用TensorFlow的while_loop算法处理大矩阵，但最近发现出现了奇怪的问题：每次运行结果都不同，有时甚至出现了nan值。我花了一些时间来缩小问题的范围，现在有了以下最小示例：我取一个大小为15000x15000的由1填充的大矩阵K，然后对一个由1填充的向量u计算K⁵u。经过一次迭代后，我期望得到一个由15000填充的向量作为结果。但实际情况并非如此。

import numpy as np
import tensorflow as tf

n = 15000
np_kernel_mat = np.ones((n, n), dtype=np.float32)
kernel_mat = tf.constant(np_kernel_mat)

# for debugging
def compare_kernel(kernel_matrix):
    print("AverageDifference:" + str(np.average(np.abs(np_kernel_mat - kernel_matrix))))
    print("AmountDifferent:" + str(np.count_nonzero(np.abs(np_kernel_mat - kernel_matrix))))
    return True

# body of the loop
def iterate(i, u):
    # for debugging
    with tf.control_dependencies(tf.py_func(compare_kernel, [kernel_mat], [tf.bool])):
        u = tf.identity(u)
    # multiply
    u = tf.matmul(kernel_mat, u)
    # check result and kernel 
    u = tf.Print(u, [tf.count_nonzero(tf.abs(kernel_mat-np_kernel_mat))], "AmountDifferentKernel: ")
    u = tf.Print(u, [tf.count_nonzero(tf.abs(u-float(n)))], "AmountDifferentRes: ")
    i = i + 1
    return i, u


def cond(i, u):
    return tf.less(i, 5)

u0 = tf.fill((n, 1), 1.0, name='u0')
iu_0 = (tf.constant(0), u0)
iu_final = tf.while_loop(cond, iterate, iu_0, back_prop=False, parallel_iterations=1)
u_res = iu_final[1]


with tf.Session() as sess:
    kernel_mat_eval, u_res_eval = sess.run([kernel_mat, u_res])
    print(np.array_equal(kernel_mat_eval, np_kernel_mat))

现在运行此命令会得到以下输出：

I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate(GHz): 1.076
pciBusID: 0000:00:0f.0
totalMemory: 11.93GiB freeMemory: 11.81GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11435 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:00:0f.0, compute capability: 5.2)
minimal_example.py:25: RuntimeWarning: invalid value encountered in subtr[8/281]
  print("AverageDifference:" + str(np.average(np.abs(np_kernel_mat - kernel_matr
ix))))
/usr/local/lib/python3.6/dist-packages/numpy/core/_methods.py:70: RuntimeWarning
: overflow encountered in reduce
  ret = umr_sum(arr, axis, dtype, out, keepdims)
AverageDifference:nan
minimal_example.py:26: RuntimeWarning: invalid value encountered in subtract
  print("AmountDifferent:" + str(np.count_nonzero(np.abs(np_kernel_mat - kernel_
matrix))))
AmountDifferent:4096
AmountDifferentKernel: [0]
AmountDifferentRes, DifferenceRes: [4][inf]
AverageDifference:nan
AmountDifferent:4096
AmountDifferentKernel: [0]
AmountDifferentRes, DifferenceRes: [15000][nan]
AverageDifference:nan
AmountDifferent:4096
AmountDifferentKernel: [0]
AmountDifferentRes, DifferenceRes: [15000][nan]
AverageDifference:nan
...

在第二次迭代中，结果不再是15000，但这并不能解释为什么差异为NaN。在CPU上，一切都正常（差异大约为2e08）。现在我的问题是：为什么Print操作的输出与py_func打印的输出不同？为什么矩阵的评估再次等于原始矩阵？为什么我在不同的运行中得到不同的结果？有人能够复现这个问题吗？我正在运行Ubuntu 16.04，TensorFlow 1.8，numpy 1.14，python3.6。GPU是GeForceGTX 1080。

NVRM version: NVIDIA UNIX x86_64 Kernel Module  390.48  Thu Mar 22 00:42:57 PDT 2018
GCC 
version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)

- Lia Fiona

8

只是路过并说一声，这是一个非常好的问题。感谢您花时间提出它。 - Jorge Leitao

我也无法使用TF 1.8（相同环境）。 - P-Gn

我已经更新到 TF 1.10，问题似乎已经消失了：然而，对于大矩阵仍会出现与预期结果相差很大的情况（我更改了示例以将结果与 n^(i+1) 的实际预期结果进行比较）。 - Lia Fiona

@LiaFiona，你能详细说明一下目前的问题以及你期望得到什么吗？我已经在CPU和GPU上使用TF 1.10运行了代码片段，在两种情况下得到了相同的结果（我使用的是Windows、Py 3.6和Titan V）。 - jdehesa

13

好的，我现在可以看到这种行为，无论是使用GPU还是CPU时都会出现。这明显是一个精度问题，切换到float64可以大大减少这种误差。请注意，您在最后一次迭代中计算的数字约为7.6×10²⁰，因此1.3×10¹⁵的误差是“相对较小的”（float32通常精确到大约7个小数位，但我想这种误差会随着迭代而积累）。 - jdehesa

显示剩余3条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Furkan Toprak · Accepted Answer

很可能，你的问题源于种子问题，请确保为random.seed()和numpy.random.seed()都设置了种子。你需要同时设置两个种子，因为numpy的随机种子与random的随机状态是独立的。