无法创建cudnn句柄：CUDNN_STATUS_INTERNAL_ERROR

Question

无法创建cudnn句柄：CUDNN_STATUS_INTERNAL_ERROR

46

我在我的Macbook Pro上安装了带有GeForce GT 750M的tensorflow 1.0.1 GPU版本。也安装了CUDA 8.0.71和cuDNN 5.1。我正在运行一段使用非CPU tensorflow正常工作的tf代码，但是在GPU版本上，我遇到了这个错误（偶尔也能正常工作）：

name: GeForce GT 750M
major: 3 minor: 0 memoryClockRate (GHz) 0.9255
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 67.48MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 67.48M (70754304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Training...

E tensorflow/stream_executor/cuda/cuda_dnn.cc:397] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:364] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) 
Abort trap: 6

这里发生了什么？这是tensorflow的一个bug吗？请帮忙。

当我运行python代码时，这是GPU内存空间：

Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 91.477 of 2047.6 MB (i.e. 4.47%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 22.852 of 2047.6 MB (i.e. 1.12%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 22.852 of 2047.6 MB (i.e. 1.12%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 36.121 of 2047.6 MB (i.e. 1.76%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 71.477 of 2047.6 MB (i.e. 3.49%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free

- Shimano

请发布您的Nvidia GPU利用率和内存数据。我猜测您的GPU内存已经用完了。 - ruoho ruotsi

在Linux上，我使用“nvidia-smi”，但在macOS上不存在。尝试使用此链接：https://github.com/phvu/cuda-smi - ruoho ruotsi

1

最初看起来像是空间不足，但我在重新启动后再次尝试，发现有足够的空间。这是终端输出。(https://pastebin.com/9D2983ex) - Shimano

好的，如果这是你的问题（或者你的问题），希望TensorFlow的开发人员能够提供一些见解：https://github.com/tensorflow/tensorflow/issues/8879 - ruoho ruotsi

我有完全相同的设置（MBP w/750M GPU）。我能够通过将CUDA驱动程序从8.083降级到8.0.46来解决此错误。我正在运行tensorflow-gpu 1.1.0（也安装了tensorflow 1.0.0，但是运行的是GPU版本）。如果我没有在GPU上释放内存，我的设置有时也会出现故障。 - anon01

显示剩余2条评论

23个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jrounds · Answer 1

当我在安装CUDA 9.0的系统上，不小心安装了CUDA 9.2 libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb而不是libcudnn7_7.0.5.15-1+cuda9.0_amd64.deb时，我遇到了这个问题。

我之所以会出现这种情况，是因为我先前安装了CUDA 9.2，然后降级到了CUDA 9.0，显然libcudnn与版本有关。

- Anjaneyalu T · Answer 2

重启机器对我有用。尝试这样做：

sudo reboot

然后，重新运行代码

- Francesco Pasa · Answer 3

对我来说，按照此处描述的方式重新运行CUDA安装程序解决了问题：

# Add NVIDIA package repository
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
sudo apt install ./cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
sudo apt update

# Install CUDA and tools. Include optional NCCL 2.x
sudo apt install cuda9.0 cuda-cublas-9-0 cuda-cufft-9-0 cuda-curand-9-0 \
    cuda-cusolver-9-0 cuda-cusparse-9-0 libcudnn7=7.2.1.38-1+cuda9.0 \
    libnccl2=2.2.13-1+cuda9.0 cuda-command-line-tools-9-0

在安装过程中，apt-get 降级了 cudnn7，我认为这可能是罪魁祸首。可能它被意外地更新到与系统的某些其他部分不兼容的版本，使用了 apt-get upgrade。