tensorflow/stream_executor/cuda/cuda_driver.cc:328] 调用 cuInit 失败：CUDA_ERROR_UNKNOWN：未知错误

Question

tensorflow/stream_executor/cuda/cuda_driver.cc:328] 调用 cuInit 失败：CUDA_ERROR_UNKNOWN：未知错误

python-3.xgputensorflow2.0nvidia

11

我正在尝试在Tensorflow中使用GPU。我的Tensorflow版本是2.4.1，我正在使用Cuda版本11.2。这是nvidia-smi的输出。

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce MX110       Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   52C    P0    N/A /  N/A |    254MiB /  2004MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1151      G   /usr/lib/xorg/Xorg                 37MiB |
|    0   N/A  N/A      1654      G   /usr/lib/xorg/Xorg                136MiB |
|    0   N/A  N/A      1830      G   /usr/bin/gnome-shell               68MiB |
|    0   N/A  N/A      5443      G   /usr/lib/firefox/firefox            0MiB |
|    0   N/A  N/A      5659      G   /usr/lib/firefox/firefox            0MiB |
+-----------------------------------------------------------------------------+

我遇到了一个奇怪的问题。之前，当我尝试使用tf.config.list_physical_devices()列出所有物理设备时，它识别出一个CPU和一个GPU。然后我尝试在GPU上进行简单的矩阵乘法，结果出现了错误：failed to synchronize cuda stream CUDA_LAUNCH_ERROR（错误代码类似于这样，我忘记记录了）。但是，在另一个终端中再次尝试相同的操作后，它无法识别任何GPU。这次，列出物理设备的结果是：

>>> tf.config.list_physical_devices()
2021-04-11 18:56:47.504776: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-11 18:56:47.507646: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-11 18:56:47.534189: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-04-11 18:56:47.534233: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: debadri-HP-Laptop-15g-dr0xxx
2021-04-11 18:56:47.534244: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: debadri-HP-Laptop-15g-dr0xxx
2021-04-11 18:56:47.534356: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 460.39.0
2021-04-11 18:56:47.534393: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.39.0
2021-04-11 18:56:47.534404: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 460.39.0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

我的操作系统是Ubuntu 20.04，Python版本为3.8.5和Tensorflow。如之前提到的，Tensorflow的版本是2.4.1，Cuda版本是11.2。我按照这些说明安装了Cuda。另外需要提供的一点信息是：当我导入tensorflow时，它显示以下输出：

import tensorflow as tf
2021-04-11 18:56:07.716683: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

我错过了什么？为什么它无法识别GPU，尽管之前能够识别？

- Ricky

这些是所需的版本。https://www.tensorflow.org/install/source#gpu - papaya

@papaya，我的配置不正确吗？我认为我正在使用链接中提到的版本。 - Ricky

安装CUDA 工具包11.0，并在执行sudo apt-get install nvidia-modprobe后重启。谢谢。 - user11530462

1

我有tensorflow 2.5和cuda 11.0，但是出现了相同的错误“failed call to cuInit: CUDA_ERROR_UNKNOWN: unknow error”，我错过了什么？ - fisakhan

@Ricky，你的nvidia-smi显示CUDA版本为11.2，而导入tensorflow显示libcudart.so.11.0。为什么这些版本不同？根据https://www.tensorflow.org/install/source#gpu，TensorFlow和CUDA的版本应该兼容。 - fisakhan

3个回答

3

我只是注册了一个帐户来说一句，@Nate的答案对我有用。我的设置和你完全相同，我已经尝试了两天。最终我所做的是重新启动 - F10进入设置 - 安全 - BIOS安全启动（或类似的东西，我记不清楚了）- 禁用。然后需要进行一些确认的额外步骤，但它很好地起作用了。我没有重新安装整个Ubuntu，因为这对我来说稍微有点技术风险。然后我尝试了tf.config行，得到了以下结果：

2021-06-14 17:12:19.546509: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1

2021-06-14 17:12:26.754680: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

2021-06-14 17:12:26.909679: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3593460000 Hz

2021-06-14 17:12:26.910016: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a8352501c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:

2021-06-14 17:12:26.910040: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version

2021-06-14 17:12:26.972350: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1

2021-06-14 17:12:27.074861: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

2021-06-14 17:12:27.075289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:0c:00.0 name: GeForce GTX 1650 computeCapability: 7.5
coreClock: 1.665GHz coreCount: 14 deviceMemorySize: 3.81GiB deviceMemoryBandwidth: 119.24GiB/s

设备属性末尾出现了更多的红线，但我并没有获取到。

Default GPU Device: /device:GPU:0

不知道为什么它可以运行，但它确实可以运行。只需更改安全启动设置。

我没有足够的经验点来赞同Nate的答案。我会稍后再回来。但他 / 她确实提供了一个很好的解决方案。

- Yuuko Hsueh

3

禁用安全启动立即解决了问题。不需要重新安装任何东西。

> import tensorflow as tf
> tf.config.list_physical_devices("GPU")
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

- Simeon Tsvetanov

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nate Frisch · Accepted Answer

我遇到了完全相同的错误，并花费了很多时间尝试弄清楚是否在安装Tensorflow相关程序时出现了问题。经过多个小时的问题解决，我发现当我设置Ubuntu 20.4时从未在我的BIOS中禁用"Secure Boot"，这导致我的NVIDIA驱动程序出现一些问题。以下是我的建议（我选择使用带有Docker的Tensorflow，这样可以避免安装所有Cuda相关的东西）- 希望对您有用!

在您的BIOS中禁用"Secure Boot"
在Ubuntu 20.4上进行全新安装
根据nvidia-container-toolkit页面的说明安装Docker。

curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

从同一页安装nvidia-container-toolkit。

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update

sudo apt-get install -y nvidia-docker2

sudo systemctl restart docker

测试确保其与之配合工作

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

最后，使用支持GPU的Docker镜像来运行Tensorflow！

docker run --gpus all -u $(id -u):$(id -g) -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter jupyter notebook --ip=0.0.0.0