无法创建cudnn句柄：CUDNN_STATUS_INTERNAL_ERROR

Question

无法创建cudnn句柄：CUDNN_STATUS_INTERNAL_ERROR

46

我在我的Macbook Pro上安装了带有GeForce GT 750M的tensorflow 1.0.1 GPU版本。也安装了CUDA 8.0.71和cuDNN 5.1。我正在运行一段使用非CPU tensorflow正常工作的tf代码，但是在GPU版本上，我遇到了这个错误（偶尔也能正常工作）：

name: GeForce GT 750M
major: 3 minor: 0 memoryClockRate (GHz) 0.9255
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 67.48MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 67.48M (70754304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Training...

E tensorflow/stream_executor/cuda/cuda_dnn.cc:397] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:364] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) 
Abort trap: 6

这里发生了什么？这是tensorflow的一个bug吗？请帮忙。

当我运行python代码时，这是GPU内存空间：

Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 91.477 of 2047.6 MB (i.e. 4.47%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 22.852 of 2047.6 MB (i.e. 1.12%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 22.852 of 2047.6 MB (i.e. 1.12%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 36.121 of 2047.6 MB (i.e. 1.76%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 71.477 of 2047.6 MB (i.e. 3.49%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free

- Shimano

请发布您的Nvidia GPU利用率和内存数据。我猜测您的GPU内存已经用完了。 - ruoho ruotsi

在Linux上，我使用“nvidia-smi”，但在macOS上不存在。尝试使用此链接：https://github.com/phvu/cuda-smi - ruoho ruotsi

1

最初看起来像是空间不足，但我在重新启动后再次尝试，发现有足够的空间。这是终端输出。(https://pastebin.com/9D2983ex) - Shimano

好的，如果这是你的问题（或者你的问题），希望TensorFlow的开发人员能够提供一些见解：https://github.com/tensorflow/tensorflow/issues/8879 - ruoho ruotsi

我有完全相同的设置（MBP w/750M GPU）。我能够通过将CUDA驱动程序从8.083降级到8.0.46来解决此错误。我正在运行tensorflow-gpu 1.1.0（也安装了tensorflow 1.0.0，但是运行的是GPU版本）。如果我没有在GPU上释放内存，我的设置有时也会出现故障。 - anon01

显示剩余2条评论

23个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Farabi Ahmed Tarhan · Answer 1

我也遇到了相同的错误，但我解决了这个问题。我的系统属性如下：

操作系统：Ubuntu 14.04
GPU：GTX 1050Ti
Nvidia驱动程序：375.66
Tensorflow：1.3.0
Cudnn：6.0.21（cudnn-8.0-linux-x64-v6.0.deb）
Cuda：8.0.61
Keras：2.0.8

我是这样解决这个问题的：

I copied cudnn files to appropriate locations (/usr/local/cuda/include and /usr/local/cuda/lib64)

I set the environment variables as:

* export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
* export CUDA_HOME=/usr/local/cuda

I also run sudo ldconfig -v command to cache the shared libraries for run time linker.

- singrium · Answer 2

我通过使用以下代码调整GPU内存使用量来解决了这个问题：

config = tf.compat.v1.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.7
tf.compat.v1.keras.backend.set_session(
    tf.compat.v1.Session(config=config))

这适用于TensorFlow 2。

- Ulrik Hørlyk Hjort · Answer 3

我曾经遇到过同样的问题，通过添加以下内容解决了它：

import os
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

- JTIM · Answer 4

请记得关闭与目录交互的tensorboard终端/cmd或其他终端，然后您可以重新启动训练，这样应该就能正常工作。

- Nwoye CID · Answer 5

这是一个与cudnn兼容的问题。检查一下你安装了使用GPU的软件，例如tensorflow-gpu。它的版本是多少？与你的cudnn版本兼容吗？而且你的cudnn是否安装了正确的cuda版本？

我观察到： cuDNN v7.0.3适用于Cuda 7.* cuDNN v7.1.2适用于Cuda 9.0 cuDNN v7.3.1适用于Cuda 9.1等等。

所以也要检查TensorFlow的正确版本是否符合你的cuda配置。例如 - 使用tensorflow-gpu： TF v1.4适用于cudnn 7.0.* TF v1.7及以上版本适用于cudnn 9.0.*等等。

所以你需要重新安装相应的cudnn版本。希望能有所帮助！

- Josmar · Answer 6

我遇到了同样的问题（Ubuntu 18.04）。我使用的是：

tensorflow 2.1
cuda 10.1
cudnn 7.6.5

我通过卸载 cuda 及其文件夹，并按照 tensorflow 页面上的说明使用 apt 安装来解决了这个问题： https://www.tensorflow.org/install/gpu?hl=fr#ubuntu_1804_cuda_101

- Nwoye CID · Answer 7

这与可用于加载GPU资源以创建cudnn句柄的内存分数有关，也称为per_process_gpu_memory_fraction。自行减少此内存分数将解决错误。

> sess_config = tf.ConfigProto(gpu_options =
> tf.GPUOptions(per_process_gpu_memory_fraction=0.7),
> allow_soft_placement = True)
> 
> with tf.Session(config=sess_config) as sess:
>      sess.run([whatever])

使用尽可能小的分数以适应您的内存。(在代码中，我使用0.7，您可以从0.3甚至更小开始，然后增加直到获得相同的错误，这是您的限制。)将其作为配置项传递给tf.Session()或tf.train.MonitoredTrainingSession()或Supervisor的sv.managed_session()。

这样可以让您的GPU为TensorFlow代码创建cudnn句柄。

- Abhay Jeet Singh · Answer 8

我也遇到了同样的问题：

Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1050
major: 6 minor: 1 memoryClockRate (GHz) 1.493 pciBusID 0000:01:00.0
Total memory: 3.95GiB
Free memory: 3.60GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:532] Check failed:  stream->parent()->GetConvolveAlgorithms(&algorithms)

Aborted (core dumped)

但在我的情况下，使用sudo命令完全正常。

- xtluo · Answer 9

我遇到了同样的问题，因为我的GPU由于一些后台僵尸/终止进程而耗尽了内存，杀死这些进程对我有用：

ps aux | grep 'Z' # Zombie
ps aux | grep 'T' # Terminated
kill -9 your_zombie_or_terminated_process_id

- Vishal Reddy · Answer 10

在我的情况下，我有两个GPU，而GPU=0正在忙于其他模型的训练。我明确地设置了GPU 1： os.environ["CUDA_VISIBLE_DEVICES"]="1"

我犯了一个错误，在创建模型后并在训练模型之前执行了上述代码行。

我通过将上述代码包含在顶部解决了这个问题，即在导入库之后。

问题是，一旦模型假定它可以使用的GPU（如果您没有明确说明，它会考虑所有可用的GPU），它以后就不会考虑只使用一个gpu的代码。