在Python中查找日期范围重叠

18

我正在尝试以更有效的方式查找数据框中特定列(id)上重叠的数据范围(每行提供起始/结束日期)。数据框已按“from”列排序。我认为有一种方式可以避免像我所做的双重apply函数:

import pandas as pd
from datetime import datetime

df = pd.DataFrame(columns=['id','from','to'], index=range(5), \
                  data=[[878,'2006-01-01','2007-10-01'],
                        [878,'2007-10-02','2008-12-01'],
                        [878,'2008-12-02','2010-04-03'],
                        [879,'2010-04-04','2199-05-11'],
                        [879,'2016-05-12','2199-12-31']])

df['from'] = pd.to_datetime(df['from'])
df['to'] = pd.to_datetime(df['to'])


    id  from        to
0   878 2006-01-01  2007-10-01
1   878 2007-10-02  2008-12-01
2   878 2008-12-02  2010-04-03
3   879 2010-04-04  2199-05-11
4   879 2016-05-12  2199-12-31

我使用了“apply”函数来循环处理所有的分组,对于每个组内,我又使用“apply”函数处理每一行数据:

def check_date_by_id(df):
    
    df['prevFrom'] = df['from'].shift()
    df['prevTo'] = df['to'].shift()
    
    def check_date_by_row(x):
        
        if pd.isnull(x.prevFrom) or pd.isnull(x.prevTo):
            x['overlap'] = False
            return x
        
        latest_start = max(x['from'], x.prevFrom)
        earliest_end = min(x['to'], x.prevTo)
        x['overlap'] = int((earliest_end - latest_start).days) + 1 > 0
        return x
    
    return df.apply(check_date_by_row, axis=1).drop(['prevFrom','prevTo'], axis=1)

df.groupby('id').apply(check_date_by_id)

    id  from        to          overlap
0   878 2006-01-01  2007-10-01  False
1   878 2007-10-02  2008-12-01  False
2   878 2008-12-02  2010-04-03  False
3   879 2010-04-04  2199-05-11  False
4   879 2016-05-12  2199-12-31  True

我的代码受到以下链接的启发:

5个回答

15
您可以将“to”列向左移,然后直接对日期时间进行减法运算。
df['overlap'] = (df['to'].shift()-df['from']) > timedelta(0)

将此应用于按id分组可能如下所示:
df['overlap'] = (df.groupby('id')
                   .apply(lambda x: (x['to'].shift() - x['from']) > timedelta(0))
                   .reset_index(level=0, drop=True))

演示

>>> df
    id       from         to
0  878 2006-01-01 2007-10-01
1  878 2007-10-02 2008-12-01
2  878 2008-12-02 2010-04-03
3  879 2010-04-04 2199-05-11
4  879 2016-05-12 2199-12-31

>>> df['overlap'] = (df.groupby('id')
                       .apply(lambda x: (x['to'].shift() - x['from']) > timedelta(0))
                       .reset_index(level=0, drop=True))

>>> df
    id       from         to overlap
0  878 2006-01-01 2007-10-01   False
1  878 2007-10-02 2008-12-01   False
2  878 2008-12-02 2010-04-03   False
3  879 2010-04-04 2199-05-11   False
4  879 2016-05-12 2199-12-31    True

1
谢谢。简单明了。您是否知道如何执行相同的操作(groupby + check),但不仅限于连续日期? - Edouard
1
我并不完全确定你的意思... 如果日期已经排序,那么还能完成什么更多的事情呢?另外,我为您添加了按id分组的示例。 - miradulo

2
您可以将“起始”时间与先前的“结束”时间进行比较:
df['to'].shift() > df['from']

输出:

0    False
1    False
2    False
3    False
4     True

2

另一种解决方案。这可以重写以利用 pandas 24 及更高版本中的 Interval.overlaps。

最初的回答:

def overlapping_groups(group):
    if len(group) > 1:
      for index, row in group.iterrows():
        for index2, row2 in group.drop(index).iterrows():
          int1 = pd.Interval(row2['start_date'],row2['end_date'], closed = 'both')
          if row['start_date'] in int1:
            return row['id']
          if row['end_date'] in int1:
            return row['id']

gcols = ['id']
group_output = df.groupby(gcols,group_keys=False).apply(overlapping_groups)
ids_with_overlap = set(group_output[~group_output.isnull()].reset_index(drop = True))
df[df['id'].isin(ids_with_overlap)]

1

自从我遇到类似的问题后,我已经进行了广泛的浏览。我找到了这个解决方案 这个解决方案。 它使用了pandas中的overlaps函数,该函数在此处有详细的文档: 这里

def function(df):
    timeintervals = pd.IntervalIndex.from_arrays(df.from,df.to,closed='both')
    index = np.arange(timeintervals.size)
    index_to_keep=[]
    for intervals in timeintervals:
        index_to_keep.append([0])
        control = timeintervals[index].overlaps(timeintervals[index[0]])
        if control.any():
            index = index[~control]
        else:
            break
        if index.size==0:
            break
        temp = df.index[index_to_keep]
        output = df.loc[temp]
        return output

1
你可以对“from”列进行排序,然后使用滚动应用函数来检查它是否与前一个“to”列重叠。这个方法非常高效。
df['from'] = pd.DatetimeIndex(df['from']).astype(np.int64)
df['to'] = pd.DatetimeIndex(df['to']).astype(np.int64)

sdf = df.sort_values(by='from')
sdf[["from", "to"]].stack().rolling(window=2).apply(lambda r: 1 if r[1] >= r[0] else 0).unstack()

现在重叠的时间段是那些from=0.0的时间段。
   from   to
0   NaN  1.0
1   1.0  1.0
2   1.0  1.0
3   1.0  1.0
4   0.0  1.0

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接