什么是具有强度1边缘矩阵的设备互连StreamExecutor？

Question

什么是具有强度1边缘矩阵的设备互连StreamExecutor？

19

我有四块NVIDIA GTX 1080显卡，在初始化会话时，我看到以下控制台输出：

Adding visible gpu devices: 0, 1, 2, 3
 Device interconnect StreamExecutor with strength 1 edge matrix:
      0 1 2 3 
 0:   N Y N N 
 1:   Y N N N 
 2:   N N N Y 
 3:   N N Y N

同时，我还有2张 NVIDIA M60 Tesla 图形卡，初始化的步骤如下：

Adding visible gpu devices: 0, 1, 2, 3
 Device interconnect StreamExecutor with strength 1 edge matrix:
      0 1 2 3 
 0:   N N N N 
 1:   N N N N 
 2:   N N N N 
 3:   N N N N

我注意到自从最新更新从1.6升级到1.8针对1080显卡的输出结果已经改变了。它看起来像这样（不能确切记得，只是回忆）：

 Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
     0 1 2 3            0 1 2 3
0:   Y N N N         0: N N Y N
1:   N Y N N    or   1: N N N Y
2:   N N Y N         2: Y N N N
3:   N N N Y         3: N Y N N

我的问题是：

什么是这个设备互连？
它对计算能力有什么影响？
为什么它针对不同的GPU有所不同？
由于硬件原因（故障，驱动程序不一致等），它是否会随着时间而改变？

- Ivan Talalaev

2

您可以在此处找到更多信息：https://developer.nvidia.com/gpudirect。基本上，如果矩阵中存在Y，则匹配的GPU可以彼此共享内存并相互传递内存，而无需返回到CPU，这可以减少多设备训练的内存开销。 - JumbaMumba

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- McAngus · Accepted Answer

TL;DR

这个设备互连是什么？

正如Almog David在评论中所说，这告诉你一个GPU是否直接访问另一个GPU的内存。

它对计算能力有什么影响？

这只对多GPU训练有影响。如果两个GPU具有设备互连，则数据传输速度更快。

为什么不同的GPU会有所不同？

这取决于硬件设置的拓扑结构。主板只有那么多个由同一总线连接的PCI-e插槽。(使用 nvidia-smi topo -m检查拓扑结构)

由于硬件原因（故障、驱动程序不一致等），它是否会随时间而改变？

我认为除非NVIDIA更改默认枚举方案，否则顺序不会随时间而改变。这里有更多细节here

解释

这条消息是在BaseGPUDeviceFactory::CreateDevices函数中生成的。它按给定顺序迭代每对设备，并调用cuDeviceCanAccessPeer。正如Almog David在评论中所提到的，这只是指示您是否可以在设备之间执行DMA。

您可以进行一个小测试来检查顺序是否重要。考虑以下代码片段：

#test.py
import tensorflow as tf

#allow growth to take up minimal resources
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

sess = tf.Session(config=config)

现在让我们检查使用不同设备顺序时CUDA_VISIBLE_DEVICES的输出。

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python3 test.py
...
2019-03-26 15:26:16.111423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:18.635894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:18.635965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2 3 
2019-03-26 15:26:18.635974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y N N 
2019-03-26 15:26:18.635982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N N N 
2019-03-26 15:26:18.635987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2:   N N N Y 
2019-03-26 15:26:18.636010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3:   N N Y N 
...

$ CUDA_VISIBLE_DEVICES=2,0,1,3 python3 test.py
...
2019-03-26 15:26:30.090493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:32.758272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:32.758349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2 3 
2019-03-26 15:26:32.758358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N N N Y 
2019-03-26 15:26:32.758364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   N N Y N 
2019-03-26 15:26:32.758389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2:   N Y N N 
2019-03-26 15:26:32.758412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3:   Y N N N
...

您可以通过运行nvidia-smi topo -m来获取更详细的连接说明。例如：

       GPU0      GPU1    GPU2   GPU3    CPU Affinity
GPU0     X       PHB    SYS     SYS     0-7,16-23
GPU1    PHB       X     SYS     SYS     0-7,16-23
GPU2    SYS      SYS     X      PHB     8-15,24-31
GPU3    SYS      SYS    PHB      X      8-15,24-31

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

我相信列表下移，传输速度越快。