HDF5数据集数量限制

Question

HDF5数据集数量限制

7

使用h5py创建一个包含多个数据集的hdf5文件时，我遇到了一个巨大的速度下降，大约在2.88百万个数据集后。这是什么原因？

我猜测数据集的树结构达到了极限，所以需要重新排序树结构，这非常耗时。

以下是一个简短的示例：

import h5py
import time

hdf5_file = h5py.File("C://TEMP//test.hdf5")

barrier = 1
start = time.clock()
for i in range(int(1e8)):
    hdf5_file.create_dataset(str(i), [])
    td = time.clock() - start
    if td > barrier:
        print("{}: {}".format(int(td), i))
        barrier = int(td) + 1

    if td > 600: # cancel after 600s
        break

编辑：

通过对数据集进行分组，可以避免这种限制：

import h5py
import time

max_n_keys = int(1e7)
max_n_group = int(1e5)

hdf5_file = h5py.File("C://TEMP//test.hdf5", "w")
group_key= str(max_n_group)
hdf5_file.create_group(group_key)

barrier=1
start = time.clock()
for i in range(max_n_keys):

    if i>max_n_group:
        max_n_group += int(1e5)
        group_key= str(max_n_group)
        hdf5_file.create_group(group_key)

    hdf5_file[group_key].create_dataset(str(i), data=[])
    td = time.clock() - start
    if td > barrier:
        print("{}: {}".format(int(td), i))
        barrier = int(td) + 1

- setzberg

1

既然您已经绘制了处理时间曲线，也许可以将其添加到问题中。此外，拥有数百万个数据集的单个文件的用例是什么？您确定不想要一个具有数百万行的单个数据集吗？ - Djizeus

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Joël Conraud · Accepted Answer

在MetaData caching中找到的hdf5组文档后，我能够推动性能极速下降的极限。基本上，我调用了H5Fset_mdc_config()（在C/C++中，不知道如何从python访问类似的HDF5函数），并更改了配置参数中的max_size值为128*1024*124。

这样做，我能够创建4倍的数据集。

希望这能有所帮助。