Tensorflow 2.0无法使用GPU,cuDNN出了些问题吗?:无法获取卷积算法。这可能是因为cuDNN初始化失败。

5

我正在尝试理解和调试我的代码。 我尝试在GPU上使用tf2.0 / tf.keras开发的CNN模型进行预测,但出现以下错误消息。 有人能帮助我修复吗?

这是我的环境配置:

enviroments:
python 3.6.8
tensorflow-gpu 2.0.0-rc0
nvidia 418.x
CUDA 10.0
cuDNN 7.6+**

以及日志文件,

2019-09-28 13:10:59.833892: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-09-28 13:11:00.228025: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-09-28 13:11:00.957534: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-09-28 13:11:00.963310: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-09-28 13:11:00.963416: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node mobilenetv2_1.00_192/Conv1/Conv2D}}]]
mobilenetv2_1.00_192/block_15_expand_BN/cond/then/_630/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0=====>GPU Available:  True
=====> 4 Physical GPUs, 1 Logical GPUs

mobilenetv2_1.00_192/block_15_expand_BN/cond/then/_630/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_depthwise_BN/cond/then/_644/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_depthwise_BN/cond/then/_644/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_project_BN/cond/then/_658/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_project_BN/cond/then/_658/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_expand_BN/cond/then/_672/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_expand_BN/cond/then/_672/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_depthwise_BN/cond/then/_686/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_depthwise_BN/cond/then/_686/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_project_BN/cond/then/_700/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_project_BN/cond/then/_700/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/Conv_1_bn/cond/then/_714/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/Conv_1_bn/cond/then/_714/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Traceback (most recent call last):
  File "NSFW_Server.py", line 162, in <module>
    model.predict(initial_tensor)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 915, in predict
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 722, in predict
    callbacks=callbacks)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 393, in model_iteration
    batch_outs = f(ins_batch)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3625, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1081, in __call__
    return self._call_impl(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1121, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node mobilenetv2_1.00_192/Conv1/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_keras_scratch_graph_10727]

Function call stack:
keras_scratch_graph

代码
if __name__ == "__main__":

    print("=====>GPU Available: ", tf.test.is_gpu_available())
    tf.debugging.set_log_device_placement(True)

    gpus = tf.config.experimental.list_physical_devices('GPU')
    if gpus:
        try:
            # Currently, memory growth needs to be the same across GPUs

            tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
            tf.config.experimental.set_memory_growth(gpus[0], True)
            logical_gpus = tf.config.experimental.list_logical_devices('GPU')
            print("=====>", len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
        except RuntimeError as e:
            # Memory growth must be set before GPUs have been initialized
            print(e)

    paras_path = "./paras/{}".format(int(2011))
    model = tf.keras.experimental.load_from_saved_model(paras_path)
    initial_tensor = np.zeros((1, INPUT_SHAPE, INPUT_SHAPE, 3))
    model.predict(initial_tensor)
4个回答

13
你需要检查是否有正确版本的CUDA + CUDNN + TensorFlow(还要确保已经全部安装)。
下面是一些运行配置示例(针对最新版本的TensorFlow进行更新)。
  1. Cuda 11.3.1 + CuDNN 8.2.1.32 + TensorFlow 2.7.0

  2. Cuda 11.0 + CuDNN 8.0.4 + TensorFlow 2.4.0

  3. Cuda 10.1 + CuDNN 7.6.5 (通常 > 7.6) + TensorFlow 2.2.0/TensorFlow 2.3.0 (TF >= 2.1 需要 CUDA >=10.1)

  4. Cuda 10.1 + CuDNN 7.6.5 (通常 > 7.6) + TensorFlow 2.1.0 (TF >= 2.1 需要 CUDA >= 10.1)

  5. Cuda 10.0 + CuDNN 7.6.3 + / TensorFlow 1.13/1.14 / TensorFlow 2.0.

  6. Cuda 9.0 + CuDNN 7.0.5 + TensorFlow 1.10

通常情况下,当您安装的TensorFlow/CuDNN版本不兼容时,会出现此错误。在我的情况下,当我尝试使用较旧的TensorFlow与更新的CuDNN版本时,就会出现此错误。
如果由于某种原因您收到类似以下错误消息(之后什么也没有发生):
依靠驱动程序执行ptx编译
解决方法:安装最新的NVIDIA驱动程序。
[在TF >= 2.5.0中似乎已解决](见下文):
仅适用于Windows用户:一些CUDA、CUDNN和TF的后期组合可能无法工作,因为存在一个bug(一个未正确命名的.dll扩展名)。针对这种特殊情况,请参阅此链接:Tensorflow GPU Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found

谢谢。为了确保CUDA/cuDNN/TF是正确的版本,我从Docker Hub上拉取了一个镜像,即“tensorflow/tensorflow:2.0.0rc0-gpu-py3”,并在容器中运行了我的代码...但它仍然无法工作,并出现相同的错误信息。 - VinSent TeZla
尝试手动安装它们,然后使用在Docker镜像中安装的依赖项再次检查。你可能错过了一点微小的差别。 - Timbus Calin
谢谢,有人尝试过以上任何版本吗?比如cuda 10.1 + CuDNN 7.64。 - Profstyle
我已经修改了答案,使其更加清晰。Cuda 10.0 不仅是 10,因为 10.0 和 10.1 之间存在差异。 - Timbus Calin
1
@Profstyle,我已经更新了最新版本的TensorFlow答案。 - Timbus Calin
显示剩余6条评论

0

对于那些遇到上述错误(针对Windows平台)的人,我通过安装与系统中已安装的CUDA兼容的CuDNN版本来解决了这个问题。

    • 要检查CUDA版本,请运行NVCC --version
    • 一旦下载了适当的版本,请从zip文件中提取文件夹。
    • 转到提取的文件夹的bin文件夹。将cudnn64:7.dll复制并粘贴到CUDA的bin文件夹中。在我的情况下,CUDA安装的位置为C:\ Program Files \ NVIDIA GPU Computing Toolkit \ CUDA \ v10.0 \ bin
    • 这应该可以解决问题。

我的系统详细信息:

  1. Windows 10
  2. CUDA 10.0
  3. TensorFlow 2.0
  4. GPU- Nvidia GTX 1060

我还发现这篇博客在Windows 10上安装支持CUDA和GPU的TensorFlow非常有用。


0
在cuda10.1 + cudnn8.0.5之前,通过更改cudnn7.6来解决问题。

-1

查看此 TensorFlow GPU 指南 page,了解适用于您操作系统的指令。这对我在 Ubuntu 16.04.6 LTS 和 Tensorflow 2.0 上解决了问题。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接