有没有一种方法可以确定TensorFlow正在使用多少GPU内存？

Question

有没有一种方法可以确定TensorFlow正在使用多少GPU内存？

gputensorflow

30

Tensorflow往往会在其GPU上预分配整个可用内存。为了调试，有没有一种方法可以告诉我们实际使用了多少内存？

- Maarten

6个回答

9

以下是我为您翻译的内容：

以下是我为您提供的一个实用解决方案：

使用TF会话配置禁用GPU内存预分配：

config = tf.ConfigProto()  
config.gpu_options.allow_growth=True  
sess = tf.Session(config=config)

运行 nvidia-smi -l （或其他工具）来监控GPU内存使用情况。

通过调试器逐步执行代码，直到发现意外的GPU内存消耗。

- eitanrich

这个在TF2中也应该可以工作 - 请参见https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth - bers

7

在 tensorflow.contrib.memory_stats 中有一些代码可以帮助解决这个问题:

from tensorflow.contrib.memory_stats.python.ops.memory_stats_ops import BytesInUse
with tf.device('/device:GPU:0'):  # Replace with device you are interested in
  bytes_in_use = BytesInUse()
with tf.Session() as sess:
  print(sess.run(bytes_in_use))

- Steve

2

我认为这在TF2中已经不存在了，请参考https://github.com/tensorflow/tensorflow/issues/40383 - bers

2

TensorFlow分析器已经改进了基于真实GPU内存分配器信息的内存时间线。详情请参见：https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/profiler#visualize-time-and-memory。请注意保留HTML标签。

- Peter

1

链接已损坏。 - mrgloom

1

链接已损坏。 - user5473110

可能的链接指向：https://www.tensorflow.org/guide/profiler#memory_profile_tool - Matěj Šmíd

0

tf.config.experimental.get_memory_info('GPU:0')

目前返回以下键：

'current': The current memory used by the device, in bytes.
'peak': The peak memory used by the device across the run of the program, in bytes.

- Vijay Mariappan

0

正如@V.M之前提到的，一个有效的解决方案是使用：tf.config.experimental.get_memory_info('DEVICE_NAME')

此函数返回一个带有两个键的字典：

'current'：设备当前使用的内存，以字节为单位
'peak'：程序运行期间设备使用的最大内存，以字节为单位。这些键的值是实际使用的内存，而不是由nvidia-smi返回的分配内存。

实际上，对于GPU，TensorFlow默认会分配所有可用内存，因此使用nvidia-smi检查代码中使用的内存是无用的。即使将tf.config.experimental.set_memory_growth设置为true，Tensorflow也不会再分配整个可用内存，而是会以离散的方式分配比实际使用的内存更多的内存，例如分配4589MiB，然后8717MiB，然后16943MiB，然后30651 MiB等。

关于get_memory_info()的一个小提示是，如果在使用tf.function()修饰的函数中使用它，则它不会返回正确的值。因此，在执行tf.function()修饰的函数后，应该使用峰值键来确定使用的峰值内存。

对于旧版本的Tensorflow，tf.config.experimental.get_memory_usage('DEVICE_NAME')是唯一可用的函数，并且仅返回已使用的内存（没有确定峰值内存的选项）。

最后，您还可以考虑使用Tensorboard提供的Tensorflow Profiler，如@Peter所提到的。

希望这有所帮助 :)

- George El Haber

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Yao Zhang · Accepted Answer

（1）时间轴提供了一些有限的支持来记录内存分配情况。以下是一个使用示例：

    run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
    summary, _ = sess.run([merged, train_step],
                          feed_dict=feed_dict(True),
                          options=run_options,
                          run_metadata=run_metadata)
    train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
    train_writer.add_summary(summary, i)
    print('Adding run metadata for', i)
    tl = timeline.Timeline(run_metadata.step_stats)
    print(tl.generate_chrome_trace_format(show_memory=True))
    trace_file = tf.gfile.Open(name='timeline', mode='w')
    trace_file.write(tl.generate_chrome_trace_format(show_memory=True))

你可以尝试使用MNIST示例（mnist with summaries）来测试此代码。

这将生成一个名为timeline的跟踪文件，您可以使用chrome://tracing打开它。请注意，这只提供了近似的GPU内存使用统计信息。它基本上模拟了GPU执行，但无法访问完整的图形元数据。它也无法知道有多少变量已经分配给GPU。

（2）对于GPU内存使用的非常粗略的衡量标准，nvidia-smi会显示在运行命令时的总设备内存使用情况。

nvprof可以显示CUDA内核级别的芯片共享内存使用和寄存器使用情况，但不显示全局/设备内存使用情况。

以下是一个示例命令：nvprof --print-gpu-trace matrixMul

更多细节请参见：http://docs.nvidia.com/cuda/profiler-users-guide/#abstract