我正在尝试在集群中的GPU节点上运行keras代码。每个GPU节点有4个GPU。我确保将GPU节点内的所有4个GPU都可用于我的使用。我运行以下代码,让tensorflow使用GPU:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
print(e)
4个GPU可用时,它们会在输出中列出。然而,运行代码时我遇到了以下错误:
Traceback (most recent call last):
File "/BayesOptimization.py", line 20, in <module>
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/framework/config.py", line 439, in list_logical_devices
return context.context().list_logical_devices(device_type=device_type)
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1368, in list_logical_devices
self.ensure_initialized()
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 511, in ensure_initialized
config_str = self.config.SerializeToString()
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1015, in config
gpu_options = self._compute_gpu_options()
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1074, in _compute_gpu_options
raise ValueError("Memory growth cannot differ between GPU devices")
ValueError: Memory growth cannot differ between GPU devices
代码是否应列出所有可用的GPU,并为每个GPU设置内存增长标志?
我目前正在使用TensorFlow库和Python 3.97版本:
tensorflow 2.4.1 gpu_py39h8236f22_0
tensorflow-base 2.4.1 gpu_py39h29c2da4_0
tensorflow-estimator 2.4.1 pyheb71bc4_0
tensorflow-gpu 2.4.1 h30adc30_0
你有任何想法这个问题是什么,以及如何解决它吗?提前感谢!