无法分配X字节的统一内存;结果:CUDA_ERROR_OUT_OF_MEMORY:内存不足。

4

我正在尝试运行一个tensorflow项目,但在大学HPC集群上遇到了内存问题。我需要对数百个输入进行预测任务,这些输入的长度不同。我们有不同数量vmen的GPU节点,因此我正在尝试以一种不会在任何GPU节点 - 输入长度组合中崩溃的方式设置脚本。

在搜索解决方案时,我尝试使用TF_FORCE_UNIFIED_MEMORY、XLA_PYTHON_CLIENT_MEM_FRACTION、XLA_PYTHON_CLIENT_PREALLOCATE和TF_FORCE_GPU_ALLOW_GROWTH,以及tensorflow的set_memory_growth。据我所知,使用统一内存,我应该能够使用比GPU本身更多的内存。

这是我的最终解决方案(仅相关部分)

os.environ['TF_FORCE_UNIFIED_MEMORY']='1'
os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION']='2.0'
#os.environ['XLA_PYTHON_CLIENT_PREALLOCATE']='false'
os.environ['TF_FORCE_GPU_ALLOW_GROWTH ']='true' # as I understood, this is redundant with the set_memory_growth part :)

import tensorflow as tf    
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      print(gpu)
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

我使用 slurm 作业调度器,并使用 --mem=30G--gres=gpu:1 提交代码到集群上。

代码崩溃并显示以下错误。我理解它尝试使用统一内存,但由于某些原因失败了。

Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5582 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:02:00.0, compute capability: 3.5)
2021-08-24 09:22:02.053935: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 12758286336 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:03.738635: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 11482457088 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:05.418059: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 10334211072 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:07.102411: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 9300789248 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:08.784349: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 8370710016 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:10.468644: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 7533638656 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:12.150588: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 6780274688 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:23:10.326528: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.33GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.


Traceback (most recent call last):
  File "scripts/script.py", line 654, in <module>
    prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
  File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
    result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
  File "env/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "env/lib/python3.7/site-packages/jax/_src/api.py", line 402, in cache_miss
    donated_invars=donated_invars, inline=inline)
  File "env/lib/python3.7/site-packages/jax/core.py", line 1561, in bind
    return call_bind(self, fun, *args, **params)
  File "env/lib/python3.7/site-packages/jax/core.py", line 1552, in call_bind
    outs = primitive.process(top_trace, fun, tracers, params)
  File "env/lib/python3.7/site-packages/jax/core.py", line 1564, in process
    return trace.process_call(self, fun, tracers, params)
  File "env/lib/python3.7/site-packages/jax/core.py", line 607, in process_call
    return primitive.impl(f, *tracers, **params)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 608, in _xla_call_impl
    *unsafe_map(arg_spec, args))
  File "env/lib/python3.7/site-packages/jax/linear_util.py", line 262, in memoized_fun
    ans = call(fun, *args)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 758, in _xla_callable
    compiled = compile_or_get_cached(backend, built, options)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 76, in compile_or_get_cached
    return backend_compile(backend, computation, compile_options)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
    return backend.compile(built_c, compile_options=options)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scripts/script.py", line 654, in <module>
    prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
  File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
    result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
  File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
    return backend.compile(built_c, compile_options=options)
RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.

我很乐意听取任何有关如何使其工作并使用所有可用内存的想法。

谢谢!


OOM不是编程错误。我认为在开始训练之前,您应该先计算出给定批量大小将消耗多少VRAM,并相应地进行调整。此外,您可以尝试梯度累积技术和混合精度训练。 - Innat
感谢M.Innat的回答。我不是在训练,而是使用模型进行预测,但有一个选项可以打开训练特性——dropout,这时就会出现OOM错误。 - aqua
2个回答

2
看起来你的GPU不完全支持统一内存。支持是有限的,实际上GPU在其内存中保存所有数据。
参见此文章的描述:https://developer.nvidia.com/blog/unified-memory-cuda-beginners/ 特别是:
在使用Tesla K80等早期GPU的系统上,调用cudaMallocManaged()会在调用时分配一定字节数的管理内存,并将其放置在GPU设备上。在内部,驱动程序还为所有被分配覆盖的页面设置页表项,以便系统知道这些页面驻留在该GPU上。
并且:
由于这些旧GPU无法进行页面错误,所有数据必须驻留在GPU上,以防内核访问它(即使内核不需要)。
根据TechPowerUp的说法,你的GPU是基于Kepler的:https://www.techpowerup.com/gpu-specs/geforce-gtx-titan-black.c2549 据我所知,TensorFlow也应该发出警告。类似于:
统一内存在计算能力低于6.0的GPU上(pre-Pascal级别的GPU)不支持超额订阅。

-1

也许this的答案对您有用。这个nvidia_smi Python模块有一些有用的工具,比如检查GPU的总内存。在这里,我重现了我之前提到的答案的代码。

import nvidia_smi

nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)

info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

print("Total memory:", info.total)

nvidia_smi.nvmlShutdown()

我认为这应该是你的起点。一个简单的解决方案是根据GPU内存设置批量大小。如果你只想得到预测结果,除了批量大小之外,通常没有其他需要太多内存的内容。此外,如果有任何GPU上的预处理,请将其传递给CPU。


谢谢你的回答!不幸的是,我已经把我能移动到CPU的所有东西都移动了。此外,将来我可能需要在更大的输入上运行它,所以我不想将其量身定制为我的当前设置。据我所知,统一内存应该适用于这些情况,但由于某种原因它无法分配。我在互联网上找不到任何针对这个特定的“failed to alloc 12758286336 bytes unified memory;”错误的解决方案。 - aqua

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接