如何检查TensorFlow是否使用所有可用的GPU

6

我正在学习如何使用Tensorflow进行对象检测。为了加速训练过程,我使用了一个拥有4个GPU的AWS g3.16xlarge实例。我正在使用以下代码运行训练过程:

export CUDA_VISIBLE_DEVICES=0,1,2,3
 python object_detection/train.py --logtostderr --pipeline_config_path=/home/ubuntu/builder/rcnn.config --train_dir=/home/ubuntu/builder/experiments/training/

在rcnn.config文件中,我设置了batch-size = 1。在运行时,我得到了以下输出: 控制台输出
2018-11-09 07:25:50.104310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2018-11-09 07:25:50.104385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2 3 
2018-11-09 07:25:50.104395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y N N N 
2018-11-09 07:25:50.104402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   N Y N N 
2018-11-09 07:25:50.104409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 2:   N N Y N 
2018-11-09 07:25:50.104416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 3:   N N N Y 
2018-11-09 07:25:50.104429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla M60, pci bus id: 0000:00:1b.0, compute capability: 5.2)
2018-11-09 07:25:50.104439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla M60, pci bus id: 0000:00:1c.0, compute capability: 5.2)
2018-11-09 07:25:50.104446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla M60, pci bus id: 0000:00:1d.0, compute capability: 5.2)
2018-11-09 07:25:50.104455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla M60, pci bus id: 0000:00:1e.0, compute capability: 5.2)

当我运行nvidia-smi时,我会得到以下输出:nvidia-smi输出
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 0000:00:1B.0     Off |                    0 |
| N/A   52C    P0   129W / 150W |   7382MiB /  7612MiB |     92%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           Off  | 0000:00:1C.0     Off |                    0 |
| N/A   33C    P0    38W / 150W |   7237MiB /  7612MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M60           Off  | 0000:00:1D.0     Off |                    0 |
| N/A   40C    P0    38W / 150W |   7237MiB /  7612MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M60           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   34C    P0    39W / 150W |   7237MiB /  7612MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     97860    C   python                                        7378MiB |
|    1     97860    C   python                                        7233MiB |
|    2     97860    C   python                                        7233MiB |
|    3     97860    C   python                                        7233MiB |
+-----------------------------------------------------------------------------+

nvidia-smi dmon提供了以下输出:

# gpu   pwr  temp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     %     %     %     %   MHz   MHz
    0   158    69    90    69     0     0  2505  1177
    1    38    36     0     0     0     0  2505   556
    2    38    45     0     0     0     0  2505   556
    3    39    37     0     0     0     0  2505   556

我对每个输出结果都感到困惑。虽然程序显示有4个不同的GPU可用,但在nvidia-smi输出中,仅显示了第一个GPU的GPU-Util百分比,其余三个GPU的值均为零。但是,相同的表格在底部打印了所有4个GPU的内存使用情况。而nvidia-smi dmon仅显示第一个GPU的sm值,其余三个GPU的值也均为零。从这篇博客中,我了解到dmon中的零表示该GPU处于空闲状态。
我想要了解的是,train.py是否利用了我实例中的所有4个GPU?如果没有利用所有GPU,我该如何确保tensorflow的object_detection/train.py适用于所有GPU。

也许还有:https://unix.stackexchange.com/questions/16407/how-to-check-which-gpu-is-active-in-linux/240036#240036 - Ciro Santilli OurBigBook.com
2个回答

4

检查它是否返回所有GPU的列表。

tf.test.gpu_device_name()

返回可用的GPU设备名称,如果不可用则返回空字符串。
然后您可以像这样使用所有可用的GPU。
# Creates a graph.
c = []
for d in ['/device:GPU:2', '/device:GPU:3']:
  with tf.device(d):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
  sum = tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(sum))

你看到下面的输出:
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K20m, pci bus
id: 0000:02:00.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: Tesla K20m, pci bus
id: 0000:03:00.0
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: Tesla K20m, pci bus
id: 0000:83:00.0
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: Tesla K20m, pci bus
id: 0000:84:00.0
Const_3: /job:localhost/replica:0/task:0/device:GPU:3
Const_2: /job:localhost/replica:0/task:0/device:GPU:3
MatMul_1: /job:localhost/replica:0/task:0/device:GPU:3
Const_1: /job:localhost/replica:0/task:0/device:GPU:2
Const: /job:localhost/replica:0/task:0/device:GPU:2
MatMul: /job:localhost/replica:0/task:0/device:GPU:2
AddN: /job:localhost/replica:0/task:0/cpu:0
[[  44.   56.]
 [  98.  128.]]

结果应该是:[[ 22. 28.] [ 49. 64.]] - user_dhrn

1

Python代码以检查GPU是否可以用于使用tensorflow

## Libraries import
import tensorflow as tf

## Test GPU
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
print('')
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接