NVIDIA-SMI因无法与NVIDIA驱动程序通信而失败。

Question

NVIDIA-SMI因无法与NVIDIA驱动程序通信而失败。

56

我正在运行一个带有Ubuntu 14.04 LTS的AWS EC2 g2.2xlarge实例。在训练TensorFlow模型时，我想观察GPU利用率。尝试运行“nvidia-smi”时出现错误。

ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls
nvidia-bug-report.sh     nvidia-debugdump     nvidia-xconfig
nvidia-cuda-mps-control  nvidia-persistenced
nvidia-cuda-mps-server   nvidia-smi
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia 
ii  nvidia-346                                            352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-346
ii  nvidia-346-dev                                        346.46-0ubuntu1                                     amd64        NVIDIA binary Xorg driver development files
ii  nvidia-346-uvm                                        346.96-0ubuntu0.0.1                                 amd64        Transitional package for nvidia-346
ii  nvidia-352                                            375.26-0ubuntu1                                     amd64        Transitional package for nvidia-375
ii  nvidia-375                                            375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary driver - version 375.39
ii  nvidia-375-dev                                        375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary Xorg driver development files
ii  nvidia-modprobe                                       375.26-0ubuntu1                                     amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-opencl-icd-346                                 352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-opencl-icd-352
ii  nvidia-opencl-icd-352                                 375.26-0ubuntu1                                     amd64        Transitional package for nvidia-opencl-icd-375
ii  nvidia-opencl-icd-375                                 375.39-0ubuntu0.14.04.1                             amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                                          0.6.2.1                                             amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                                       375.26-0ubuntu1                                     amd64        Tool for configuring the NVIDIA graphics driver
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ 

$ inxi -G
Graphics:  Card-1: Cirrus Logic GD 5446 
           Card-2: NVIDIA GK104GL [GRID K520] 
           X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X

$  lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
    Subsystem: XenSource, Inc. Device 0001
    Kernel driver in use: cirrus
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
    Subsystem: NVIDIA Corporation Device 1014
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

我按照以下步骤安装了CUDA 7和cuDNN：

$sudo apt-get -q2 update
$sudo apt-get upgrade
$sudo reboot

=======================================================================

重启后，运行“$sudo update-initramfs -u”命令更新initramfs。

现在，请编辑/etc/modprobe.d/blacklist.conf文件以屏蔽nouveau驱动。在编辑器中打开该文件，并在文件末尾插入以下行：

blacklist nouveau blacklist lbm-nouveau options nouveau modeset=0 alias nouveau off alias lbm-nouveau off

保存并退出文件。

现在安装构建基本工具，然后按以下步骤更新initramfs并再次重启：

$sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential
$sudo update-initramfs -u
$sudo reboot

重新启动后，运行以下命令安装Nvidia。

$sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run
$sudo chmod 700 ./cuda_7.0.28_linux.run
$sudo ./cuda_7.0.28_linux.run
$sudo update-initramfs -u
$sudo reboot

现在系统已经启动，通过运行以下命令来验证安装。

$sudo modprobe nvidia
$sudo nvidia-smi -q | head`enter code here`

你应该看到输出类似于“nvidia.png”。

现在运行以下命令。 $

cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery
$make
$./deviceQuery

然而，在Tensorflow训练模型时，“nvidia-smi”仍然不显示GPU活动：

ubuntu@ip-10-0-1-48:~$ ipython
Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec  6 2015, 18:08:32) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import tensorflow as tf 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally



ubuntu@ip-10-0-1-48:~$ nvidia-smi
Thu Mar 30 05:45:26 2017       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   35C    P0    38W / 125W |     10MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

- dbl001

1

对我有效的方法是运行：nvidia-settings，然后选择NVIDIA GPU（根据您的喜好选择Performance/On-Demand）。它之前被设置为Intel。 - Akhil

27个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- zfireear · Answer 1

在 Linux 内核更新后，可能会出现此错误。您可以使用以下命令重新构建 nvidia 驱动程序以进行修复：

首先，您需要安装 dkms，它可以在内核版本更改后自动重新生成新模块。
sudo apt-get install dkms
其次，重新构建您的 nvidia 驱动程序。这里我的 nvidia 驱动程序版本是 440.82，如果您之前已经安装了，可以在 /usr/src 中检查已安装的版本。
sudo dkms build -m nvidia -v 440.82
最后，重新安装 nvidia 驱动程序。然后重新启动计算机。
sudo dkms install -m nvidia -v 440.82

现在，您可以通过 sudo nvidia-smi 检查是否可以使用。

- Sanket Bodake · Answer 2

0

chmod 700表示您可以对文件或目录进行任何操作，其他用户无法访问它

首先运行以下命令：

chmod 700 ./Nvidia.xyz.run
启动Nvidia驱动程序

sudo ./Nvidia.xyx.run

- Sanket Bodake

感谢您对Stack Overflow社区做出贡献的兴趣。这个问题已经有很多答案了，其中一个答案已经得到社区的广泛验证。您确定您的方法之前没有被提到过吗？如果是这样的话，能否解释一下您的方法与众不同的地方，在什么情况下您的方法可能更好，并且为什么您认为之前的答案不够满意。您能否编辑您的答案并提供一些解释呢？ - undefined

- s510 · Answer 3

Ubuntu 22.04

sudo apt remove --purge nvidia* && sudo ubuntu-drivers autoinstall && sudo reboot

- dbl001 · Answer 4

我需要在g2.2xlarge Ubuntu 14.04LTS实例上安装NVIDIA 367.57驱动程序和CUDA 7.5与Tensorflow相关联。例如：nvidia-graphics-drivers-367_367.57.orig.tar 现在，当我训练TensorFlow模型时，GRID K520 GPU正在工作。

ubuntu@ip-10-0-1-70:~$ nvidia-smi
Sat Apr  1 18:03:32 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   39C    P8    43W / 125W |   3800MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      2254    C   python                                        3798MiB |
+-----------------------------------------------------------------------------+

ubuntu@ip-10-0-1-70:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GRID K520"
  CUDA Driver Version / Runtime Version          8.0 / 7.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 4036 MBytes (4232052736 bytes)
  ( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
  GPU Max Clock rate:                            797 MHz (0.80 GHz)
  Memory Clock rate:                             2500 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 3
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GRID K520
Result = PASS

- Montoya · Answer 5

对我都没有帮助。

我正在使用在Google Cloud上的Kubernetes和Tesla K-80 GPU。

按照此指南确保您已正确安装：

https://cloud.google.com/kubernetes-engine/docs/how-to/gpus

我错过了一些重要的事情：

在您的NODES上安装NVIDIA GPU设备驱动程序。使用以下命令：

COS节点：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

UBUNTU节点：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

确保更新已应用到您的节点。如果关闭了升级，请重新启动它们。

我在我的Docker中使用这个镜像nvidia/cuda:10.1-base-ubuntu16.04
你必须设置GPU限制！这是节点驱动程序与Pod通信的唯一方式。在您的yaml配置下，添加以下内容到您的容器中：
```
resources:
  limits:
    nvidia.com/gpu: 1
```

- Anthony Dave · Answer 6

在/usr/src目录下：

ls

找到你的NVIDIA驱动程序版本（例如nvidia-535.129.03），然后：

sudo apt-get install dkms
sudo dkms install -m nvidia -v 535.129.03

如果版本中有 srv（例如，nvidia-srv-535.129.03），则在版本系列之前添加 srv： sudo dkms install -m nvidia -v srv-535.129.03。问题已解决：

- Tony Hill · Answer 7

-16

尝试拔出NVIDIA显卡并重新插入。

- Tony Hill

6

这有什么可能帮助呢？ - giusti

13

在AWS实例上如何完成这个操作？ - Caustic

这不是SNES游戏卡带。 - Eduardo Reis