Pandas在系列/数据帧上有一个resample
方法,但似乎没有办法只对DatetimeIndex
进行重新取样?
具体而言,我有一个每天的Datetimeindex
,可能会有缺失日期,我想将它按小时频率重新取样,但只包括原始日常索引中的日期。
除了我下面尝试的方式,还有更好的方法吗?
In [56]: daily_index = pd.period_range('01-Jan-2017', '31-Jan-2017', freq='B').asfreq('D')
In [57]: daily_index
Out[57]:
PeriodIndex(['2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05',
'2017-01-06', '2017-01-09', '2017-01-10', '2017-01-11',
'2017-01-12', '2017-01-13', '2017-01-16', '2017-01-17',
'2017-01-18', '2017-01-19', '2017-01-20', '2017-01-23',
'2017-01-24', '2017-01-25', '2017-01-26', '2017-01-27',
'2017-01-30', '2017-01-31'],
dtype='int64', freq='D')
In [58]: daily_index.shape
Out[58]: (22,)
In [59]: hourly_index = pd.DatetimeIndex([]).union_many(
...: pd.date_range(day.to_timestamp('H','S'), day.to_timestamp('H','E'), freq='H')
...: for day in daily_index
...: )
In [60]: hourly_index
Out[60]:
DatetimeIndex(['2017-01-02 00:00:00', '2017-01-02 01:00:00',
'2017-01-02 02:00:00', '2017-01-02 03:00:00',
'2017-01-02 04:00:00', '2017-01-02 05:00:00',
'2017-01-02 06:00:00', '2017-01-02 07:00:00',
'2017-01-02 08:00:00', '2017-01-02 09:00:00',
...
'2017-01-31 14:00:00', '2017-01-31 15:00:00',
'2017-01-31 16:00:00', '2017-01-31 17:00:00',
'2017-01-31 18:00:00', '2017-01-31 19:00:00',
'2017-01-31 20:00:00', '2017-01-31 21:00:00',
'2017-01-31 22:00:00', '2017-01-31 23:00:00'],
dtype='datetime64[ns]', length=528, freq=None)
In [61]: 22*24
Out[61]: 528
In [62]: %%timeit
...: hourly_index = pd.DatetimeIndex([]).union_many(
...: pd.date_range(day.to_timestamp('H','S'), day.to_timestamp('H','E'), freq='H')
...: for day in daily_index
...: )
100 loops, best of 3: 13.7 ms per loop
更新:
我选择了@NTAWolf答案的略微变体,它具有类似的性能但不会重新排序输入日期,以防它们没有排序。
def resample_index(index, freq):
"""Resamples each day in the daily `index` to the specified `freq`.
Parameters
----------
index : pd.DatetimeIndex
The daily-frequency index to resample
freq : str
A pandas frequency string which should be higher than daily
Returns
-------
pd.DatetimeIndex
The resampled index
"""
assert isinstance(index, pd.DatetimeIndex)
start_date = index.min()
end_date = index.max() + pd.DateOffset(days=1)
resampled_index = pd.date_range(start_date, end_date, freq=freq)[:-1]
series = pd.Series(resampled_index, resampled_index.floor('D'))
return pd.DatetimeIndex(series.loc[index].values)
In [184]: %%timeit
...: hourly_index3 = pd.date_range(daily_index.start_time.min(),
...: daily_index.end_time.max() + 1,
...: normalize=True, freq='H')
...: hourly_index3 = hourly_index3[hourly_index3.floor('D').isin(daily_index.start_time)]
100 loops, best of 3: 2.97 ms per loop
In [185]: %timeit resample_index(daily_index.to_timestamp('D','S'), freq='H')
100 loops, best of 3: 2.93 ms per loop
floor
可以获得额外的加分 - 我对 pandas api 比较熟悉,但这个新的我还不太了解! - Dave Hirschfeldhourly_index3.date
是否比hourly_index3.floor('D')
更快。 - IanShourly_index3.date
不能直接插入,因为它是一个numpy数组,没有.isin
。此外,它具有dtype=object
,这对我们的目的不是很有效率。因此,在上面的更新2中看到的修改。 - thorbjornwolfpd.DateOffset(days=1)
也很不错;这对我来说是新的!它感觉比仅使用 +1 并希望数据类型正确要安全得多。 - thorbjornwolf