TensorFlow:内存增长不能在GPU设备之间不同 | 如何使用TensorFlow进行多GPU训练

5

我正在尝试在集群中的GPU节点上运行keras代码。每个GPU节点有4个GPU。我确保将GPU节点内的所有4个GPU都可用于我的使用。我运行以下代码,让tensorflow使用GPU:

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)
        

4个GPU可用时,它们会在输出中列出。然而,运行代码时我遇到了以下错误:
Traceback (most recent call last):
  File "/BayesOptimization.py", line 20, in <module>
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/framework/config.py", line 439, in list_logical_devices
    return context.context().list_logical_devices(device_type=device_type)
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1368, in list_logical_devices
    self.ensure_initialized()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 511, in ensure_initialized
    config_str = self.config.SerializeToString()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1015, in config
    gpu_options = self._compute_gpu_options()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1074, in _compute_gpu_options
    raise ValueError("Memory growth cannot differ between GPU devices")
ValueError: Memory growth cannot differ between GPU devices

代码是否应列出所有可用的GPU,并为每个GPU设置内存增长标志?

我目前正在使用TensorFlow库和Python 3.97版本:

tensorflow                2.4.1           gpu_py39h8236f22_0
tensorflow-base           2.4.1           gpu_py39h29c2da4_0
tensorflow-estimator      2.4.1              pyheb71bc4_0
tensorflow-gpu            2.4.1                h30adc30_0

你有任何想法这个问题是什么,以及如何解决它吗?提前感谢!


你好!你能否将 tf.config.set_memory_growth(gpu, True) 替换为 tf.config.experimental.set_memory_growth(gpu, True),并告诉我们结果如何吗? - user11530462
嗨!我已经做过了,但是还是出现了相同的错误。 - Lana
你能检查一下这个 gist 的更改吗?https://colab.sandbox.google.com/gist/mohantym/ac018207e02ddb995818a74667161e0b/stack_71319195.ipynb。你也可以设置一个范围,比如[1:]/[2:]来使用特定的 GPU 卡。 - user11530462
在我删除了以下两行代码后,程序运行了: logical_gpus = tf.config.list_logical_devices('GPU') print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")不确定这是否解决了问题,set_memory_growth 代码正常工作,但当调用另一行代码时,会抛出错误。 - Lana
2个回答

2

尝试使用以下命令:os.environ["CUDA_VISIBLE_DEVICES"]="0",而不是tf.config.experimental.set_memory_growth。这对我有效。


1
你的回答可以通过提供更多支持信息来改进。请编辑以添加进一步的细节,例如引用或文档,以便他人可以确认你的答案是正确的。您可以在帮助中心找到有关如何编写良好答案的更多信息。 - Community

0

在多GPU设备的情况下,内存增长应该在所有可用的GPU上保持恒定。要么为所有GPU设置为true,要么保持为false。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

Tensorflow GPU文档


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接