在不将大数据加载到RAM中的情况下，将其加载到TensorFlow 2.0中

Question

在不将大数据加载到RAM中的情况下，将其加载到TensorFlow 2.0中

pythonnumpytensorflowtensorflow-datasets

8

我已经处理并保存了一个大型的视频和音频文件数据集（大约8至9 GB的数据）。这些数据被保存为两个numpy数组，一个用于每个模态。文件的形状为（示例数量，最长时间长度，特征长度）

我想使用这些数据来训练我的神经网络进行分类任务。我使用的是TensorFlow 2.0 Beta版本。我在Google Colab上运行所有代码（在安装tf-2.0 beta后）。每次我将数据加载到tf.data中时，整个虚拟机的RAM都会被使用，并且会强制重启会话。

之前的方法：

我尝试了两种方法

1）完全将两个变量加载到RAM中并将其转换为张量

2）将数据作为内存映射数组（从磁盘）加载并加载到tf.data中

但是这两种方法都会加载RAM并强制VM重新启动

代码：

# Access the Audio memory from disk without loading
X_audio = np.memmap('gdrive/My Drive/Codes/audio_data.npy', dtype='float32', mode='r').reshape(2198,3860,74)

# Access the Video memory from disk without loading
X_video = np.memmap('gdrive/My Drive/Codes/video_data.npy', dtype='float32', mode='r').reshape(2198,1158,711)

# Load labels
with open('gdrive/My Drive/Codes/label_data_3','rb') as f:
    Y = pkl.load(f)

dataset = tf.data.Dataset.from_tensor_slices((X_audio, X_video, Y)).shuffle(2198).batch(32)

错误：您的会话在使用所有可用的RAM后崩溃。

- Anirudh B H

我不是专家，但听起来这是 dask 的工作。 - Daniel F

2个回答

1

你应该使用HDF5文件格式，这是一种在硬盘上存储多维数组的好方法。具体来说，我建议您使用h5py包，在Python中使用HDF5文件提供了无缝接口。

现在，我还没有使用过TensorFlow 2，但在TF1中，我们可以从Python生成器创建TensorFlow数据集对象。下面，我们有一个生成器，它将加载一个HDF5文件并从数组中提取一个随机元素（沿第一个轴）。

import h5py
import random

def iterate_dataset(dataset_file, dataset_name):
    h5 = h5py.File(dataset_file, 'r')
    idxs = range(len(h5[dataset_name]))
    random.shuffle(idxs)

    for i in idxs:
        yield h5[dataset_name][i]
    h5.close()

这里还有保存数组为HDF5文件的代码。

import h5py

def save_array(arr, dataset_file, dataset_name, compress=True)
    with h5py.File(dataset_file, 'a') as h5:
        if compress:
            dataset = h5.create_dataset(
                dataset_name,
                data=arr,
                chunks=(1, *arr.shape[1:]),
                compression='lzf'
            )
            return
        h5[dataset_name] = arr

save_array(data1, 'filename.hdf5', 'data1')
save_array(data2, 'filename.hdf5', 'data2')

最后，可能会有一些代码错误，所以我会在电脑上再读一遍。

- Yngve Moe

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Praveen Kulkarni · Accepted Answer

使用tensorflow 2.x.x数据集API，您可以使用tf.data.Dataset.from_generator从生成器函数创建数据集。此生成器函数将通过numpy memap执行读取操作。

以下代码创建一个虚拟数据文件，然后从磁盘上的文件逐个读取示例。可以轻松地更新代码以按顺序读取多个示例以提高IO吞吐量（如果您需要在下面的代码示例中实现，请告诉我）。

# imports
import numpy as np
import pathlib
import tensorflow as tf

# create huge numpy array and save it to disk
file = pathlib.Path("huge_data.npy")
examples = 5000
example_shape = (256, 256)
huge_data_shape = (examples, *example_shape)
huge_data_dtype = np.float64

# create file if does not exist
if not file.is_file():
    print("creating file with random data and saving to disk")
    numpy_data = np.random.rand(*huge_data_shape).astype(huge_data_dtype)
    np.save(file, numpy_data)

# memmap the file
numpy_data_memmap = np.load(file, mmap_mode='r')


# generator function
def data_generator():
    return iter(numpy_data_memmap)


# create tf dataset from generator fn
dataset = tf.data.Dataset.from_generator(
    generator=data_generator,
    output_types=huge_data_dtype,
    output_shapes=example_shape,
)

# consume huge dataset
for i, ex in enumerate(dataset):
    print(i, ex.shape, ex.dtype)

输出：

0 (256, 256) <dtype: 'float64'>
1 (256, 256) <dtype: 'float64'>
2 (256, 256) <dtype: 'float64'>
3 (256, 256) <dtype: 'float64'>
...
4995 (256, 256) <dtype: 'float64'>
4996 (256, 256) <dtype: 'float64'>
4997 (256, 256) <dtype: 'float64'>
4998 (256, 256) <dtype: 'float64'>
4999 (256, 256) <dtype: 'float64'>