使用pandas按最接近的时间合并数据框。

11

我有两个数据框 (logsfailures),我想将它们合并,以便在 logs 中添加一列,该列的值是在“failures”中找到的最近日期。

下面是生成logsfailures和所需的output的代码:

import pandas as pd
logs=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50:11']),'var1':pd.Series([0,1,3,1,2,4])})
logs['date-time']=pd.to_datetime(logs['date-time'])
failures=pd.DataFrame({'date':pd.Series(['23/10/2015 00:00:00','22/10/2015 00:00:00','21/10/2015 00:00:00']),'failure':pd.Series([1,1,1])})
failures['date']=pd.to_datetime(failures['date'])
output=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50:11']),'var1':pd.Series([0,1,3,1,2,4]),'closest_failure':pd.Series(['23/10/2015 00:00:00','22/10/2015 00:00:00','21/10/2015 00:00:00','23/10/2015 00:00:00','23/10/2015 00:00:00','23/10/2015 00:00:00'])})
output['date-time']=pd.to_datetime(output['date-time'])

有什么想法吗?真实数据集非常大,因此效率也是一个问题。

2个回答

19
在 Pandas 的版本号大于等于0.19.0时,你可以使用 pandas.merge_asof 获取近似匹配。在0.19版本中,你只能获取最接近的失败值,该值在对数值之前或与其相同。然而,在0.20版本中,你可以在两个方向上获取最接近的值。

执行asof合并。这类似于左连接,但我们是根据最近的键而不是相等的键来匹配。

对于左侧DataFrame中的每一行,我们选择右侧DataFrame中最后一个“on”键小于或等于左侧键的行。两个DataFrames必须按键进行排序。

In [3]: failures.sort_values("date", inplace=True)

In [6]: logs2=pd.DataFrame({'date-time':pd.Series(['23/10/2015 10:20:54','22/10/2015 09:51:32','21/10/2015 06:51:32','28/10/2015 16:59:32','25/10/2015 04:41:32','24/10/2015 11:50
   ...: :11', "20/10/2015 01:02:03"]),'var1':pd.Series([0,1,3,1,2,4, 99])})
   ...: 

In [7]: logs2['date-time']=pd.to_datetime(logs2['date-time'])

In [8]: logs2.sort_values("date-time", inplace=True)

In [9]: logs2
Out[9]: 
            date-time  var1
6 2015-10-20 01:02:03    99
2 2015-10-21 06:51:32     3
1 2015-10-22 09:51:32     1
0 2015-10-23 10:20:54     0
5 2015-10-24 11:50:11     4
4 2015-10-25 04:41:32     2
3 2015-10-28 16:59:32     1

In [10]: pd.merge_asof(logs2, failures, left_on="date-time", right_on="date")
Out[10]: 
            date-time  var1       date  failure
0 2015-10-20 01:02:03    99        NaT      NaN
1 2015-10-21 06:51:32     3 2015-10-21      1.0
2 2015-10-22 09:51:32     1 2015-10-22      1.0
3 2015-10-23 10:20:54     0 2015-10-23      1.0
4 2015-10-24 11:50:11     4 2015-10-23      1.0
5 2015-10-25 04:41:32     2 2015-10-23      1.0
6 2015-10-28 16:59:32     1 2015-10-23      1.0

In [11]: pd.merge_asof(logs2, failures, left_on="date-time", right_on="date", direction="nearest")
Out[11]: 
            date-time  var1       date  failure
0 2015-10-20 01:02:03    99 2015-10-21        1
1 2015-10-21 06:51:32     3 2015-10-21        1
2 2015-10-22 09:51:32     1 2015-10-22        1
3 2015-10-23 10:20:54     0 2015-10-23        1
4 2015-10-24 11:50:11     4 2015-10-23        1
5 2015-10-25 04:41:32     2 2015-10-23        1
6 2015-10-28 16:59:32     1 2015-10-23        1

5
您可以使用method="nearest"重新索引。可能有更好的方法,但是使用索引中带有失败日志的Series和值可行:
In [11]: failures_dt = pd.Series(failures["date"].values, failures["date"])

In [12]: failures_dt.reindex(logs["date-time"], method="nearest")
Out[12]:
date-time
2015-10-23 10:20:54   2015-10-23
2015-10-22 09:51:32   2015-10-22
2015-10-21 06:51:32   2015-10-21
2015-10-28 16:59:32   2015-10-23
2015-10-25 04:41:32   2015-10-23
2015-10-24 11:50:11   2015-10-23
dtype: datetime64[ns]

In [13]: logs["nearest"] = failures_dt.reindex(logs["date-time"], method="nearest").values

In [14]: logs
Out[14]:
            date-time  var1    nearest
0 2015-10-23 10:20:54     0 2015-10-23
1 2015-10-22 09:51:32     1 2015-10-22
2 2015-10-21 06:51:32     3 2015-10-21
3 2015-10-28 16:59:32     1 2015-10-23
4 2015-10-25 04:41:32     2 2015-10-23
5 2015-10-24 11:50:11     4 2015-10-23

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接