Pandas HDFStore多重索引DataFrame：如何高效获取所有索引

Question

Pandas HDFStore多重索引DataFrame：如何高效获取所有索引

pythonpandashdfstore

7

在Pandas中，是否有一种有效的方法可以以表格格式高效地提取HDFStore中存在的所有MultiIndex索引？

我可以使用where=有效地进行select()，但我想要所有索引，而没有任何列。我也可以使用iterator=True来节省内存进行select()，但这仍然意味着从磁盘读取几乎所有的表格，因此速度仍然很慢。

我一直在store.root..table.*中寻找，希望能够获得索引值列表。我是否在正确的轨道上？

备选方案是保留一个更短的MultiIndex DataFrame，每次附加主DataFrame时都只包含空的DataFrame。我可以检索它并比主DataFrame更便宜地获取索引。虽然不太优雅。

- Tony

2个回答

1

即使是处理更大的表格的工作人员，也可能会发现Jeff建议的解决方案最终会出现内存错误。这是一种更优雅的解决方案，但在我的情况下我不能使用它（对于一个有2亿行的表，具有日期时间索引，在16GB RAM的桌面上）。我最终采用了以下（不幸的不太优雅）的解决方案，其中h5store是HDFStore对象，一个作为表保存的多索引DataFrame，具有timestamp索引（Float64），它是一个CSI索引：

%%time
#ts = h5store.select_column(h5store.keys()[0], column='timestamp').unique()

chunkshape = int(1e7) # can vary due to machine and hdf5

## get a list of chunks unique timestamps
ts = [indx.index.get_level_values('timestamp').unique() 
          for indx in h5full.select(h5full.keys()[0], columns=['timestamp'],
                                    stop=None, # change for a smaller selection
                                    chunksize=chunkshape)
      ]
## drop duplicates at the the end-points
for i in range(len(ts)-1):
    if ts[i][-1]==ts[i+1][0]:
         ts[i] = ts[i][:-1]
## merge to single ndarray
ts = np.concatenate(ts)

这次运行所花费的时间（超过2亿行）：

CPU times: user 14min 16s, sys: 2min 34s, total: 16min 50s
Wall time: 14min 45s

- eldad-a

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jeff · Accepted Answer

创建一个多索引数据框。

In [35]: df = DataFrame(randn(100000,3),columns=list('ABC'))

In [36]: df['one'] = 'foo'

In [37]: df['two'] = 'bar'

In [38]: df.ix[50000:,'two'] = 'bah'

In [40]: mi = df.set_index(['one','two'])

In [41]: mi
Out[41]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 3 columns):
A    100000  non-null values
B    100000  non-null values
C    100000  non-null values
dtypes: float64(3)

把它作为一个表存储

In [42]: store = pd.HDFStore('test.h5',mode='w')

In [43]: store.append('df',mi)

get_storer函数将返回存储的对象（但不会检索数据）。

In [44]: store.get_storer('df').levels
Out[44]: ['one', 'two']

In [2]: store
Out[2]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df            frame_table  (typ->appendable_multi,nrows->100000,ncols->5,indexers->[index],dc->[two,one])

索引级别被创建为数据列，这意味着您可以在选择中使用它们。以下是仅选择索引的方法。

In [48]: store.select('df',columns=['one'])
Out[48]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Empty DataFrame

选择单列并将其作为mi-frame返回

In [49]: store.select('df',columns=['A'])
Out[49]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 1 columns):
A    100000  non-null values
dtypes: float64(1)

为了选择单个列作为Series（也可以作为索引，因为它们存储为列），这将非常快。

In [2]: store.select_column('df','one')
Out[2]: 
0     foo
1     foo
2     foo
3     foo
4     foo
5     foo
6     foo
7     foo
8     foo
9     foo
10    foo
11    foo
12    foo
13    foo
14    foo
...
99985    foo
99986    foo
99987    foo
99988    foo
99989    foo
99990    foo
99991    foo
99992    foo
99993    foo
99994    foo
99995    foo
99996    foo
99997    foo
99998    foo
99999    foo
Length: 100000, dtype: object

如果您真的想要最快速地选择仅索引

In [4]: %timeit store.select_column('df','one')
100 loops, best of 3: 8.71 ms per loop

In [5]: %timeit store.select('df',columns=['one'])
10 loops, best of 3: 43 ms per loop

或者获取完整的索引

In [6]: def f():
   ...:     level_1 =  store.select_column('df','one')
   ...:     level_2 =  store.select_column('df','two')
   ...:     return MultiIndex.from_arrays([ level_1, level_2 ])
   ...: 

In [17]: %timeit f()
10 loops, best of 3: 28.1 ms per loop

如果您想获取每个级别的值，一种非常快速的方法是：

In [2]: store.select_column('df','one').unique()
Out[2]: array(['foo'], dtype=object)

In [3]: store.select_column('df','two').unique()
Out[3]: array(['bar', 'bah'], dtype=object)