如何在不加载整个文件的情况下从hdf5文件中读取给定行的数据集?我有一些非常大的hdf5文件,其中包含许多数据集。以下是我想要减少时间和内存使用的示例:
#! /usr/bin/env python
import numpy as np
import h5py
infile = 'field1.87.hdf5'
f = h5py.File(infile,'r')
group = f['Data']
mdisk = group['mdisk'].value
val = 2.*pow(10.,10.)
ind = np.where(mdisk>val)[0]
m = group['mcold'][ind]
print m
ind
不会给出连续的行,而是散乱的。
上面的代码失败了,但它遵循了切片一个 hdf5 数据集的标准方式。我收到的错误消息是:
Traceback (most recent call last):
File "./read_rows.py", line 17, in <module>
m = group['mcold'][ind]
File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/dataset.py", line 425, in __getitem__
selection = sel.select(self.shape, args, dsid=self.id)
File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 71, in select
sel[arg]
File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 209, in __getitem__
raise TypeError("PointSelection __getitem__ only works with bool arrays")
TypeError: PointSelection __getitem__ only works with bool arrays
mdisk
数组加载到内存中。我需要查阅文档才能确定有多少mcold
被加载。这可能取决于ind
是否为紧凑切片或分散在整个数组中的值。 - hpaulj