合并重叠的日期时间间隔

Question

合并重叠的日期时间间隔

4

我有一个包含多个日期时间间隔（开始时间，结束时间）和值的数据框。

输入：

id   start          end            value
1    08:00:00.000   12:00:00.000   5
2    09:00:00.000   10:00:00.000   6
2    10:00:00.000   14:00:00.000   4
1    12:00:00.000   15:00:00.000   3

expected output:

id   start          end            value 
1    08:00:00.000   09:00:00.000   5
2    09:00:00.000   10:00:00.000   6
1    10:00:00.000   12:00:00.000   5
2    12:00:00.000   14:00:00.000   4
1    14:00:00.000   15:00:00.000   3

它们之间存在重叠。目标是有一系列不重叠的时间间隔。

图表

当存在重叠时，我希望保留价值最高的时间间隔。

我编写了一个循环代码，在数据框上查找重叠的时间间隔，根据条件创建新的时间间隔并删除旧的时间间隔。我想找到一种更好的优化方法。也许是在交叉点处拆分所有时间间隔，然后在数据框上循环并基于条件删除重叠的时间间隔。

done = True

while done:
    done = False
    df_copy = df
    for i, row in df.iterrows():
        row_interval = pd.Interval(row.start, row.end)
        if done:
            break
        for j, row_copy in row_copy.iterrows():
            row_copy_interval = pd.Interval(row_copy.start, row_copy.end)
            if i is not j and row_interval.overlaps(row_copy_interval):
                earliest_start = np.minimum(row.start, row_copy.start)
                latest_start = np.maximum(row.start, row_copy.start)
                earliest_end = np.minimum(row.end, row_copy.end)
                latest_end = np.maximum(row.end, row_copy.end)

                if row.value > row_copy.value:
                    value = row.value
                else:
                    value = row_copy.value

                if row_interval == pd.Interval(earliest_start, earliest_end):
                    df = df.append('value': row.value, 'start': earliest_start,'end': latest_start}, ignore_index=True)
                    df = df.append('value': value, 'start': latest_start,'end': earliest_end}, ignore_index=True)
                    df = df.append('value': row_copy.value, 'start': earliest_end,'end': latest_end}, ignore_index=True)
                elif row_interval == pd.Interval(earliest_start, latest_end):
                    ...
                elif row_interval == pd.Interval(latest_start, latest_end):
                    ...
                elif row_interval == pd.Interval(latest_start, earliest_end):
                    ...

                df = df.drop([i, j]).drop_duplicates()
                done = True
                break

- NCall

1

你想看一下resampling。 - kelyen

我不知道如何将它应用到我的解决方案中，但它似乎很有趣。 - NCall

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Guybrush · Accepted Answer

我是Python库"portion"的维护者，它可以处理任意（可比较）对象的（并集）区间（详情请参见https://github.com/AlexandreDecan/portion，也可以在PyPI上找到）。 portion提供了许多功能，其中包括IntervalDict类，它基本上像一个经典的字典一样，其中键是（并集的）区间。由于这个类允许您将所有日期（时间）区间放入单个结构中并在其之上应用一些逻辑，因此这个类对于您的用例可能会很有帮助。

一个IntervalDict对象定义了一个.merge函数，接受另一个IntervalDict作为输入，并以一个函数作为输入来解释这两个IntervalDict实例如何合并。使用该函数，您可以指定所有重叠区间的"max"值都将被保留。换句话说：为您的数据帧的每一行创建一个IntervalDict实例，然后使用max函数作为输入对它们进行迭代应用.merge函数，最终你将得到一个(key, value)对的列表，其中每个key是一个（不重叠的）时间区间，每个value将是该区间内日期（时间）值的最大值。