在Python中确定时间序列重叠的最有效方法

9
我将尝试使用Python的pandas库来确定两个时间序列重叠的百分比。由于数据是非同步的,因此每个数据点的时间不会对齐。这里有一个例子:
时间序列1:
2016-10-05 11:50:02.000734    0.50
2016-10-05 11:50:03.000033    0.25
2016-10-05 11:50:10.000479    0.50
2016-10-05 11:50:15.000234    0.25
2016-10-05 11:50:37.000199    0.50
2016-10-05 11:50:49.000401    0.50
2016-10-05 11:50:51.000362    0.25
2016-10-05 11:50:53.000424    0.75
2016-10-05 11:50:53.000982    0.25
2016-10-05 11:50:58.000606    0.75

时间序列 2

2016-10-05 11:50:07.000537    0.50
2016-10-05 11:50:11.000994    0.50
2016-10-05 11:50:19.000181    0.50
2016-10-05 11:50:35.000578    0.50
2016-10-05 11:50:46.000761    0.50
2016-10-05 11:50:49.000295    0.75
2016-10-05 11:50:51.000835    0.75
2016-10-05 11:50:55.000792    0.25
2016-10-05 11:50:55.000904    0.75
2016-10-05 11:50:57.000444    0.75

假设这个系列保持其价值直到下一次变化,最有效的方法是如何确定它们具有相同价值的时间百分比?让我们计算从11:50:07.000537开始到2016-10-05 11:50:57.000444结束之间这些系列重叠的时间。因为这段时间内两个系列都有数据,所以结果为0.75。他们同时存在的时间如下:
11:50:10.000479 - 11:50:15.000234 (两者的值均为0.5) 4.999755 秒
11:50:37.000199 - 11:50:49.000295 (两者的值均为0.5) 12.000096 秒
11:50:53.000424 - 11:50:53.000982 (两者的值均为0.75) 0.000558 秒
11:50:55.000792 - 11:50:55.000904 (两者的值均为0.25) 0.000112 秒
结果为(4.999755+12.000096+0.000558+0.000112) / 49.999907 = 34%
我的实际时间序列有更多的数据,如1000-10000条观测值,我需要运行更多的对比。我考虑过向前填充一个系列,然后只需比较行并将匹配总数除以总行数,但我认为这样做效率不高。
2个回答

7

设置
创建两个时间序列

from StringIO import StringIO
import pandas as pd


txt1 = """2016-10-05 11:50:02.000734    0.50
2016-10-05 11:50:03.000033    0.25
2016-10-05 11:50:10.000479    0.50
2016-10-05 11:50:15.000234    0.25
2016-10-05 11:50:37.000199    0.50
2016-10-05 11:50:49.000401    0.50
2016-10-05 11:50:51.000362    0.25
2016-10-05 11:50:53.000424    0.75
2016-10-05 11:50:53.000982    0.25
2016-10-05 11:50:58.000606    0.75"""

s1 = pd.read_csv(StringIO(txt1), sep='\s{2,}', engine='python',
                 parse_dates=[0], index_col=0, header=None,
                 squeeze=True).rename('s1').rename_axis(None)

txt2 = """2016-10-05 11:50:07.000537    0.50
2016-10-05 11:50:11.000994    0.50
2016-10-05 11:50:19.000181    0.50
2016-10-05 11:50:35.000578    0.50
2016-10-05 11:50:46.000761    0.50
2016-10-05 11:50:49.000295    0.75
2016-10-05 11:50:51.000835    0.75
2016-10-05 11:50:55.000792    0.25
2016-10-05 11:50:55.000904    0.75
2016-10-05 11:50:57.000444    0.75"""

s2 = pd.read_csv(StringIO(txt2), sep='\s{2,}', engine='python',
                 parse_dates=[0], index_col=0, header=None,
                 squeeze=True).rename('s2').rename_axis(None)

TL;DR的意思是“太长不看”。
df = pd.concat([s1, s2], axis=1).ffill().dropna()
overlap = df.index.to_series().diff().shift(-1) \
            .fillna(0).groupby(df.s1.eq(df.s2)).sum()
overlap.div(overlap.sum())

False    0.666657
True     0.333343
Name: duration, dtype: float64

说明

构建基础的 pd.DataFrame df

  • 使用 pd.concat 来对齐索引
  • 使用 ffill 让值向前传播
  • 使用 dropna 摆脱在另一个序列开始之前的值

df = pd.concat([s1, s2], axis=1).ffill().dropna()
df

enter image description here

计算 'duration'
从当前时间戳到下一个时间戳

df['duration'] = df.index.to_series().diff().shift(-1).fillna(0)
df

在此输入图片描述

计算重叠部分

  • df.s1.eq(df.s2) 生成布尔值序列,说明 s1 是否与 s2 重叠。
  • 使用 groupby 上述布尔值序列,汇总 TrueFalse 时的总持续时间。

overlap = df.groupby(df.s1.eq(df.s2)).duration.sum()
overlap

False   00:00:33.999548
True    00:00:17.000521
Name: duration, dtype: timedelta64[ns]

同值时间百分比。
overlap.div(overlap.sum())

False    0.666657
True     0.333343
Name: duration, dtype: float64

1
不错!美观问题:在这里使用.eq而不是==有什么理由吗? - Zeugma
1
@Boud 编码风格。如果我想做其他事情,我不喜欢将整个表达式包装在括号中。我认为在某些情况下,甚至测试过它更快。 - piRSquared

3
很棒的问题。我在没有使用pandas或numpy的情况下进行了暴力破解,但是我得到了你的答案(感谢你的努力解决问题)。我还没有在其他任何东西上测试过它。我也不知道它有多快,因为它只是一次通过每个数据框,但不进行任何向量化处理。
import pandas as pd
#############################################################################
#Preparing the dataframes
times_1 = ["2016-10-05 11:50:02.000734","2016-10-05 11:50:03.000033",
           "2016-10-05 11:50:10.000479","2016-10-05 11:50:15.000234",
           "2016-10-05 11:50:37.000199","2016-10-05 11:50:49.000401",
           "2016-10-05 11:50:51.000362","2016-10-05 11:50:53.000424",
           "2016-10-05 11:50:53.000982","2016-10-05 11:50:58.000606"]
times_1 = [pd.Timestamp(t) for t in times_1]
vals_1 = [0.50,0.25,0.50,0.25,0.50,0.50,0.25,0.75,0.25,0.75]

times_2 = ["2016-10-05 11:50:07.000537","2016-10-05 11:50:11.000994",
           "2016-10-05 11:50:19.000181","2016-10-05 11:50:35.000578",
           "2016-10-05 11:50:46.000761","2016-10-05 11:50:49.000295",
           "2016-10-05 11:50:51.000835","2016-10-05 11:50:55.000792",
           "2016-10-05 11:50:55.000904","2016-10-05 11:50:57.000444"]
times_2 = [pd.Timestamp(t) for t in times_2]
vals_2 = [0.50,0.50,0.50,0.50,0.50,0.75,0.75,0.25,0.75,0.75]

data_1 = pd.DataFrame({"time":times_1,"vals":vals_1})
data_2 = pd.DataFrame({"time":times_2,"vals":vals_2})
#############################################################################

shared_time = 0      #Keep running tally of shared time
t1_ind = 0           #Pointer to row in data_1 dataframe
t2_ind = 0           #Pointer to row in data_2 dataframe

#Loop through both dataframes once, incrementing either the t1 or t2 index
#Stop one before the end of both since do +1 indexing in loop
while t1_ind < len(data_1.time)-1 and t2_ind < len(data_2.time)-1:
    #Get val1 and val2
    val1,val2 = data_1.vals[t1_ind], data_2.vals[t2_ind]

    #Get the start and stop of the current time window
    t1_start,t1_stop = data_1.time[t1_ind], data_1.time[t1_ind+1]
    t2_start,t2_stop = data_2.time[t2_ind], data_2.time[t2_ind+1]

    #If the start of time window 2 is in time window 1
    if val1 == val2 and (t1_start <= t2_start <= t1_stop):
        shared_time += (min(t1_stop,t2_stop)-t2_start).total_seconds()
        t1_ind += 1
    #If the start of time window 1 is in time window 2
    elif val1 == val2 and t2_start <= t1_start <= t2_stop:
        shared_time += (min(t1_stop,t2_stop)-t1_start).total_seconds()
        t2_ind += 1
    #If there is no time window overlap and time window 2 is larger
    elif t1_start < t2_start:
        t1_ind += 1
    #If there is no time window overlap and time window 1 is larger
    else:
        t2_ind += 1

#How I calculated the maximum possible shared time (not pretty)
shared_start = max(data_1.time[0],data_2.time[0])
shared_stop = min(data_1.time.iloc[-1],data_2.time.iloc[-1])
max_possible_shared = (shared_stop-shared_start).total_seconds()

#Print output
print "Shared time:",shared_time
print "Total possible shared:",max_possible_shared
print "Percent shared:",shared_time*100/max_possible_shared,"%"

输出:

Shared time: 17.000521
Total possible shared: 49.999907
Percent shared: 34.0011052421 %

很酷,这就是我要找的,但我希望有更快的解决方案。我确实需要在if val1!= val2的逻辑下更改相同索引更新的逻辑,因为您不能假设您只能移动第一个索引。 - klib
@klib 是的,我也觉得可能有pandas或numpy的解决方案。你绝对是对的,不能假设第一个索引,我已经编辑了我的答案。 - mitoRibo

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接