- tf-nightly版本 = 2.12.0-dev2023203
- Python版本 = 3.10.6
- CUDA驱动程序版本 = 525.85.12
- CUDA版本 = 12.0
- Cudnn版本 = 8.5.0
- 我正在使用Linux(x86_64,Ubuntu 22.04)
- 我在venv虚拟环境中使用Visual Studio Code进行编码
我正在尝试在GPU(NVIDIA GeForce RTX 3050)上运行一些模型,使用tensorflow nightly 2.12(以便能够使用Cuda 12.0)。我的问题是,显然我所做的每个检查似乎都是正确的,但最终脚本无法检测到GPU。我已经花了很多时间来尝试看看发生了什么事情,但似乎没有任何作用,因此任何建议或解决方案都将不胜感激。正如您在问题的最后可以看到的那样,GPU似乎对torch有效。
我将展示一些我在CUDA方面进行的最常见检查(Visual Studio Code终端),希望您会发现它们有用:
检查CUDA版本:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Fri_Jan__6_16:45:21_PST_2023 Cuda compilation tools, release 12.0, V12.0.140 Build cuda_12.0.r12.0/compiler.32267302_0
检查与CUDA库的连接是否正确:
$ echo $LD_LIBRARY_PATH
/usr/cuda/lib
检查GPU的nvidia驱动程序并检查GPU是否可读取venv:
$ nvidia-smi
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A | | N/A 40C P5 6W / 20W | 46MiB / 4096MiB | 22% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1356 G /usr/lib/xorg/Xorg 45MiB | +-----------------------------------------------------------------------------+
添加cuda/bin路径并检查:
$ export PATH="/usr/local/cuda/bin:$PATH"
$ echo $PATH
/usr/local/cuda-12.0/bin:/home/victus-linux/Escritorio/MasterThesis_CODE/to_share/venv_master/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin
自定义函数以检查CUDA是否正确安装:[function by Sherlock]
function lib_installed() { /sbin/ldconfig -N -v $(sed 's/:/ /' <<< $LD_LIBRARY_PATH) 2>/dev/null | grep $1; } function check() { lib_installed $1 && echo "$1已安装" || echo "错误:$1未安装"; } check libcuda check libcudart
libcudart.so.12 -> libcudart.so.12.0.146 libcuda.so.1 -> libcuda.so.525.85.12 libcuda.so.1 -> libcuda.so.525.85.12 libcudadebugger.so.1 -> libcudadebugger.so.525.85.12 libcuda已安装 libcudart.so.12 -> libcudart.so.12.0.146 libcudart已安装
自定义函数以检查Cudnn是否正确安装:[function by Sherlock]
function lib_installed() { /sbin/ldconfig -N -v $(sed 's/:/ /' <<< $LD_LIBRARY_PATH) 2>/dev/null | grep $1; } function check() { lib_installed $1 && 所以,一旦我完成了这些先前的检查,我使用脚本来评估是否一切都最终正常,然后出现了以下错误: <pre class="lang-py prettyprint-override"><code>import tensorflow as tf print(f'\nTensorflow version = {tf.__version__}\n') print(f'\n{tf.config.list_physical_devices("GPU")}\n')
2023-03-02 12:05:09.463343: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-03-02 12:05:09.489911: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-03-02 12:05:09.490522: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-02 12:05:10.066759: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Tensorflow version = 2.12.0-dev20230203 2023-03-02 12:05:10.748675: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-03-02 12:05:10.771263: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... []
额外检查:我尝试在torch上运行一个检查脚本,在这里它可以工作,所以我猜问题与tensorflow/tf-nightly有关。
import torch print(f'\nAvailable cuda = {torch.cuda.is_available()}') print(f'\nGPUs availables = {torch.cuda.device_count()}') print(f'\nCurrent device = {torch.cuda.current_device()}') print(f'\nCurrent Device location = {torch.cuda.device(0)}') print(f'\nName of the device = {torch.cuda.get_device_name(0)}')
Available cuda = True GPUs availables = 1 Current device = 0 Current Device location = <torch.cuda.device object at 0x7fbe26fd2ec0> Name of the device = NVIDIA GeForce RTX 3050 Laptop GPU
如果您知道任何可以帮助解决这个问题的信息,请不要犹豫告诉我。