使用h5py从大文件中读取数据而无需将整个文件加载到内存中

Question

使用h5py从大文件中读取数据而无需将整个文件加载到内存中

8

以下代码能否使用h5py在python中从数据集中读取数据，而无需一次性将整个数据集加载到内存中[因为整个数据集无法装入内存]，并获取数据集的大小？如果不行，应该怎么做？

h5 = h5py.File('myfile.h5', 'r')
mydata = h5.get('matirx') # are all data loaded into memory by using h5.get?
part_of_mydata= mydata[1000:11000,:]
size_data =  mydata.shape

谢谢。

- superMind

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hpaulj · Accepted Answer

get（或索引）获取文件上数据集的引用，但不加载任何数据。

In [789]: list(f.keys())
Out[789]: ['dset', 'dset1', 'vset']
In [790]: d=f['dset1']
In [791]: d
Out[791]: <HDF5 dataset "dset1": shape (2, 3, 10), type "<f8">
In [792]: d.shape         # shape of dataset
Out[792]: (2, 3, 10)
In [793]: arr=d[:,:,:5]    # indexing the set fetches part of the data
In [794]: arr.shape
Out[794]: (2, 3, 5)
In [795]: type(d)
Out[795]: h5py._hl.dataset.Dataset
In [796]: type(arr)
Out[796]: numpy.ndarray

d数据集类似于数组，但不是numpy数组。

使用以下代码获取整个数据集：

In [798]: arr = d[:]
In [799]: type(arr)
Out[799]: numpy.ndarray

要读取您的切片所需读取的文件部分取决于切片、数据布局、分块和其他通常不在您控制范围内且不应该让您担心的事情。

请注意，当读取一个数据集时，我不会加载其他数据集。对于组也是如此。

http://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data