TensorFlow GPU内存不足:分配器(GPU_0_bfc)在尝试分配内存时用完了内存。

5
我对Tensorflow比较新手,在使用Datasets时遇到了一些问题。我在Windows 10上工作,使用的是Tensorflow 2.6.0和CUDA。
我有两个numpy数组,它们是X_train和X_test(已经拆分)。训练集大小为5Gb,测试集大小为1.5Gb。它们的形状分别为:
X_train: (259018, 30, 30, 3), Y_train: (259018, 1), 我使用以下代码创建Datasets:
dataset_train = tf.data.Dataset.from_tensor_slices((X_train , Y_train)).batch(BATCH_SIZE)

批量大小(BATCH_SIZE)为32。

但我无法创建数据集,会出现以下错误:

2021-09-02 15:26:35.429930: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-02 15:26:35.772235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3495 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
2021-09-02 15:26:36.414627: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 2700000000 exceeds 10% of free system memory.
2021-09-02 15:26:47.146977: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 607.1KiB (rounded to 621824)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2021-09-02 15:26:47.147299: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] BFCAllocator dump for GPU_0_bfc
2021-09-02 15:26:47.147383: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (256):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.147514: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (512):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.147636: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (1024):     Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2021-09-02 15:26:47.147761: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (2048):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.147905: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (4096):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148040: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (8192):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148157: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (16384):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148276: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (32768):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148402: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (65536):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148518: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (131072):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148645: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (262144):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148786: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (524288):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148918: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (1048576):  Total Chunks: 1, Chunks in use: 1. 1.91MiB allocated for chunks. 1.91MiB in use in bin. 1.91MiB client-requested in use in bin.
2021-09-02 15:26:47.149079: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (2097152):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.149212: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (4194304):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.149342: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (8388608):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

2021-09-02 15:26:47.149477: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (16777216):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.

2021-09-02 15:26:47.164471: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (33554432):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.164619: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (67108864):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.164765: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (134217728):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.164884: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (268435456):    Total Chunks: 2, Chunks in use: 2. 3.41GiB allocated for chunks. 3.41GiB in use in bin. 3.30GiB client-requested in use in bin.
2021-09-02 15:26:47.164982: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Bin for 607.2KiB was 512.0KiB, Chunk State: 
2021-09-02 15:26:47.165040: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Next region of size 3665166336
2021-09-02 15:26:47.165106: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at b0e200000 of size 2700000000 next 1
2021-09-02 15:26:47.165159: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at baf0ebb00 of size 1280 next 2
2021-09-02 15:26:47.165208: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at baf0ec000 of size 2000128 next 3
2021-09-02 15:26:47.165250: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at baf2d4500 of size 963164928 next 18446744073709551615
2021-09-02 15:26:47.165297: I tensorflow/core/common_runtime/bfc_allocator.cc:1065]      Summary of in-use Chunks by size: 
2021-09-02 15:26:47.165341: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 1280 totalling 1.2KiB
2021-09-02 15:26:47.165382: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 2000128 totalling 1.91MiB
2021-09-02 15:26:47.165426: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 963164928 totalling 918.54MiB
2021-09-02 15:26:47.165470: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 2700000000 totalling 2.51GiB
2021-09-02 15:26:47.165514: I tensorflow/core/common_runtime/bfc_allocator.cc:1072] Sum Total of in-use chunks: 3.41GiB
2021-09-02 15:26:47.165558: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] total_region_allocated_bytes_: 3665166336 memory_limit_: 3665166336 available bytes: 0 curr_region_allocation_bytes_: 7330332672
2021-09-02 15:26:47.165633: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats: 
Limit:                      3665166336
InUse:                      3665166336
MaxInUse:                   3665166336
NumAllocs:                           4
MaxAllocSize:               2700000000
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2021-09-02 15:26:47.165771: W tensorflow/core/common_runtime/bfc_allocator.cc:468] *************************************************************************************************xxx
Traceback (most recent call last):
  File "C:/Users/headl/Documents/github projects/datascience/DL_model_deep_insight.py", line 100, in <module>
    dataset_train, dataset_test = prepare_tf_dataset(path_to_x_train, config.y_train_combined,
  File "C:/Users/headl/Documents/github projects/datascience/DL_model_deep_insight.py", line 28, in prepare_tf_dataset
    dataset_test = tf.data.Dataset.from_tensor_slices((X_test , Y_test)).batch(BATCH_SIZE)
  File "C:\Users\headl\Documents\virtual_env\datascience\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py", line 685, in from_tensor_slices
    return TensorSliceDataset(tensors)
  File "C:\Users\headl\Documents\virtual_env\datascience\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py", line 3844, in __init__
    element = structure.normalize_element(element)
  File "C:\Users\headl\Documents\virtual_env\datascience\lib\site-packages\tensorflow\python\data\util\structure.py", line 129, in normalize_element
    ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
  File "C:\Users\headl\Documents\virtual_env\datascience\lib\site-packages\tensorflow\python\profiler\trace.py", line 163, in wrapped
    return func(*args, **kwargs)
  File "C:\Users\headl\Documents\virtual_env\datascience\lib\site-packages\tensorflow\python\framework\ops.py", line 1566, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "C:\Users\headl\Documents\virtual_env\datascience\lib\site-packages\tensorflow\python\framework\tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "C:\Users\headl\Documents\virtual_env\datascience\lib\site-packages\tensorflow\python\framework\constant_op.py", line 271, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "C:\Users\headl\Documents\virtual_env\datascience\lib\site-packages\tensorflow\python\framework\constant_op.py", line 283, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "C:\Users\headl\Documents\virtual_env\datascience\lib\site-packages\tensorflow\python\framework\constant_op.py", line 308, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "C:\Users\headl\Documents\virtual_env\datascience\lib\site-packages\tensorflow\python\framework\constant_op.py", line 106, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

Process finished with exit code 1

似乎存在GPU内存耗尽的问题,实际上,在我按照此过程在Windows任务管理器中进行跟踪时,可以看到在脚本死亡之前GPU使用率达到峰值。我尝试只使用X_train的一部分,可以创建一个数据集,最多只能使用X_train[:240000]。当我在此之后添加更多行时,错误就会出现。
我认为Tensorflow Dataset是一个生成器,应该可以处理内存问题和批次?此外,减少批次大小也没有任何效果。
我还尝试了建议的“TF_GPU_ALLOCATOR=cuda_malloc_async”,但它也没有起作用。
我该怎么做来加载整个数据?
非常感谢您的帮助!
2个回答

6

这是按设计工作的。from_tensor_slices 只适用于小数据量,Dataset 旨在处理需要从磁盘流式传输的大型数据集。

实现此目标的较困难但理想的方法是将 numpy 数组数据写入 TFRecords,然后通过 TFRecordDataset 将其作为数据集读取。这是指南。

https://www.tensorflow.org/tutorials/load_data/tfrecord

较为简单但性能较低的方法是使用Dataset.from_generator。以下是一个最小化示例:

>>> ds = tf.data.Dataset.from_generator(lambda: np.arange(100), output_signature=tf.TensorSpec(shape=(), dtype=tf.int32))
>>> for d in ds:
...   print(d)
... 
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
...

谢谢Yaoshiang,看起来对我来说最好的选择是创建一个TFRecordDataset! - anastasia-enot
这将是理想的,而且不难…肯定不会超过一天。只需要学习一些新的API,但你将在你的TF职业生涯中继续使用这些知识,所以值得一试。 - Yaoshiang

2

确保您的批处理大小为32。批处理大小越大越好,但有时会引起此问题。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接