使用 Pandas 读取由 h5py 创建的 HDF5 文件。

Question

使用 Pandas 读取由 h5py 创建的 HDF5 文件。

4

我有一堆hdf5文件，我想将其中的一些数据转换为parquet文件。但是，我很难使用pandas/pyarrow读取它们。我认为这可能与文件最初创建的方式有关。

如果我使用h5py打开文件，则数据看起来与我预期的完全相同。

import h5py

file_path = "/data/some_file.hdf5"
hdf = h5py.File(file_path, "r")
print(list(hdf.keys()))

提供给我

>>> ['foo', 'bar', 'baz']

在这种情况下，我对名为“bar”的组感兴趣，其中包含3个项。

如果我尝试使用HDFStore读取数据，则无法访问任何组。

import pandas as pd

file_path = "/data/some_file.hdf5"
store = pd.HDFStore(file_path, "r")

那么 HDFStore 对象就没有键或组。

assert not store.groups()
assert not store.keys()

如果我尝试访问数据，会出现以下错误

bar = store.get("/bar")

TypeError: cannot create a storer if the object is not existing nor a value are passed

如果我尝试使用 pd.read_hdf，它看起来像是文件为空。

import pandas as pd

file_path = "/data/some_file.hdf"
df = pd.read_hdf(file_path, mode="r")

ValueError: Dataset(s) incompatible with Pandas data types, not table, or no datasets found in HDF5 file.

并且

import pandas as pd

file_path = "/data/some_file.hdf5"
pd.read_hdf(file_path, key="/interval", mode="r")

TypeError: cannot create a storer if the object is not existing nor a value are passed

根据这个答案，我认为问题与Pandas期望非常特定的分层结构有关，而实际的hdf5文件的分层结构与其不同。

是否有一种简单直接的方法可以将任意的hdf5文件读入到pandas或pytables中？如果需要，我可以使用h5py加载数据。但是文件太大了，如果可能的话，我想避免将它们加载到内存中。所以，理想情况下，我希望尽可能地在pandas和pyarrow中处理数据。

- Batman

如果数据已加载到DataFrame中，则在内存中。看起来您需要将数据集读取为numpy数组，然后从这些数组创建DataFrame。通常，pandas使用数组而无需进一步复制。 - hpaulj

你是正确的——Pandas使用非常特定的架构（分层结构）来创建和读取HDF5文件。 Pandas布局在引用答案中显示（如axis0，axis1，block1_items等）。它是一个有效的HDF5模式，但不是普通用户会从NumPy数组与h5py或PyTables创建的模式。您想对“bar”中的数据执行什么操作？正如@hpaulj所说，您可以使用h5py读取数据并加载到数据框中。 h5py数据集对象“类似于”numy数组，但占用小内存。 - kcw78

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- NeStack · Accepted Answer

我之前也遇到了一个类似的问题，就是无法将hdf5文件读入pandas的数据框中。通过这篇帖子，我编写了一个脚本，将hdf5文件转换成字典，然后再将字典转换成pandas的数据框，就像这样：

import h5py
import pandas as pd


dictionary = {}
with h5py.File(filename, "r") as f:
    for key in f.keys():
        print(key)

        ds_arr = f[key][()]   # returns as a numpy array
        dictionary[key] = ds_arr # appends the array in the dict under the key

df = pd.DataFrame.from_dict(dictionary)

只要每个hdf5键（f.keys()）只是你想在pandas df中使用的列名，而不是组名，这个方法就有效。在hdf5中可以存在更复杂的层次结构，但在pandas中不存在。如果在键的上面的层次结构中出现一个组，例如名为data_group的组，对我来说作为替代解决方案的方法是用f['data_group'].keys()替换f.keys()，用f['data_group'][key]替换f[key]。