如果这些内容已经是datetime64格式(你应该是这样的),那么最好不要使用字符串,因为在比较之前需要进行计算...而字符串操作很慢,这样效率更高。
In [11]: s = pd.Series(pd.to_datetime(['2014-02-21 17:16:42', '2014-02-22 17:16:42']))
In [12]: s
Out[12]:
0 2014-02-21 17:16:42
1 2014-02-22 17:16:42
dtype: datetime64[ns]
你可以进行简单的订单检查:
In [13]: (pd.Timestamp('2014-02-21') < s) & (s < pd.Timestamp('2014-02-22'))
Out[13]:
0 True
1 False
dtype: bool
In [14]: s.loc[(pd.Timestamp('2014-02-21') < s) & (s < pd.Timestamp('2014-02-22'))]
Out[14]:
0 2014-02-21 17:16:42
dtype: datetime64[ns]
然而,使用 DatetimeIndex.normalize
更快(它获取每个时间戳的午夜时间戳):
In [15]: pd.DatetimeIndex(s).normalize()
Out[15]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-02-21, 2014-02-22]
Length: 2, Freq: None, Timezone: None
In [16]: pd.DatetimeIndex(s).normalize() == pd.Timestamp('2014-02-21')
Out[16]: array([ True, False], dtype=bool)
In [17]: s.loc[pd.DatetimeIndex(s).normalize() == pd.Timestamp('2014-02-21')]
Out[17]:
0 2014-02-21 17:16:42
dtype: datetime64[ns]
以下是一些时间(与上文中的 s 相关):
In [21]: %timeit s.loc[s.str.startswith('2014-02-21')]
1000 loops, best of 3: 1.16 ms per loop
In [22]: %timeit s.loc[(pd.Timestamp('2014-02-21') < s) & (s < pd.Timestamp('2014-02-22'))]
1000 loops, best of 3: 1.23 ms per loop
In [23]: %timeit s.loc[pd.DatetimeIndex(s).normalize() == pd.Timestamp('2014-02-21')]
1000 loops, best of 3: 405 µs per loop
稍微增大s值,结果更加明显:
In [31]: s = pd.Series(pd.to_datetime(['2014-02-21 17:16:42', '2014-02-22 17:16:42'] * 1000))
In [32]: %timeit s.loc[s.str.startswith('2014-02-21')]
10 loops, best of 3: 105 ms per loop
In [33]: %timeit s.loc[(pd.Timestamp('2014-02-21') < s) & (s < pd.Timestamp('2014-02-22'))]
1000 loops, best of 3: 1.3 ms per loop
In [34]: %timeit s.loc[pd.DatetimeIndex(s).normalize() == pd.Timestamp('2014-02-21')]
1000 loops, best of 3: 694 µs per loop
注意:在你的示例中,列df ['date_time']
是字符串,你需要执行以下操作:df.loc[pd.DatetimeIndex(df['date_time']) == ...]
。