创建一个多索引数据框。
In [35]: df = DataFrame(randn(100000,3),columns=list('ABC'))
In [36]: df['one'] = 'foo'
In [37]: df['two'] = 'bar'
In [38]: df.ix[50000:,'two'] = 'bah'
In [40]: mi = df.set_index(['one','two'])
In [41]: mi
Out[41]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 3 columns):
A 100000 non-null values
B 100000 non-null values
C 100000 non-null values
dtypes: float64(3)
把它作为一个表存储
In [42]: store = pd.HDFStore('test.h5',mode='w')
In [43]: store.append('df',mi)
get_storer
函数将返回存储的对象(但不会检索数据)。
In [44]: store.get_storer('df').levels
Out[44]: ['one', 'two']
In [2]: store
Out[2]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df frame_table (typ->appendable_multi,nrows->100000,ncols->5,indexers->[index],dc->[two,one])
索引级别被创建为数据列,这意味着您可以在选择中使用它们。以下是仅选择索引的方法。
In [48]: store.select('df',columns=['one'])
Out[48]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Empty DataFrame
选择单列并将其作为mi-frame返回
In [49]: store.select('df',columns=['A'])
Out[49]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
Data columns (total 1 columns):
A 100000 non-null values
dtypes: float64(1)
为了选择单个列作为Series(也可以作为索引,因为它们存储为列),这将非常快。
In [2]: store.select_column('df','one')
Out[2]:
0 foo
1 foo
2 foo
3 foo
4 foo
5 foo
6 foo
7 foo
8 foo
9 foo
10 foo
11 foo
12 foo
13 foo
14 foo
...
99985 foo
99986 foo
99987 foo
99988 foo
99989 foo
99990 foo
99991 foo
99992 foo
99993 foo
99994 foo
99995 foo
99996 foo
99997 foo
99998 foo
99999 foo
Length: 100000, dtype: object
如果您真的想要最快速地选择仅索引
In [4]: %timeit store.select_column('df','one')
100 loops, best of 3: 8.71 ms per loop
In [5]: %timeit store.select('df',columns=['one'])
10 loops, best of 3: 43 ms per loop
或者获取完整的索引
In [6]: def f():
...: level_1 = store.select_column('df','one')
...: level_2 = store.select_column('df','two')
...: return MultiIndex.from_arrays([ level_1, level_2 ])
...:
In [17]: %timeit f()
10 loops, best of 3: 28.1 ms per loop
如果您想获取每个级别的值,一种非常快速的方法是:
In [2]: store.select_column('df','one').unique()
Out[2]: array(['foo'], dtype=object)
In [3]: store.select_column('df','two').unique()
Out[3]: array(['bar', 'bah'], dtype=object)
select_column(...).unique()
也很有用,这实际上会让你得到特定级别的值,不确定这对你是否有用。 - Jeffselect_column(...).unique()
的解决方案,因为它很优雅。但是(至少在我的情况下),对于大表格来说,它会因为内存错误而失败。请看下面我为这种情况提出的临时解决方案——我希望能看到更优雅的解决方案(我不喜欢自己的)。 - eldad-a