在时间序列的pandas数据框中填补缺失值。

Question

在时间序列的pandas数据框中填补缺失值。

pythonpandasdatetimetime-seriespandas-resample

4

我有一个pandas数据框，其中时间序列存在间隙。
它看起来像这样:

示例输入

--------------------------------------
     Timestamp        Close
 2021-02-07 09:30:00  124.624 
 2021-02-07 09:31:00  124.617
 2021-02-07 10:04:00  123.946
 2021-02-07 16:00:00  123.300
 2021-02-09 09:04:00  125.746
 2021-02-09 09:05:00  125.646
 2021-02-09 15:58:00  125.235
 2021-02-09 15:59:00  126.987
 2021-02-09 16:00:00  127.124

期望的输出结果

--------------------------------------------
     Timestamp        Close
 2021-02-07 09:30:00  124.624 
 2021-02-07 09:31:00  124.617
 2021-02-07 09:32:00  124.617
 2021-02-07 09:33:00  124.617
   'Insert a line for each minute up to the next available
   timestamp with the Close value form the last available timestamp'
 2021-02-07 10:03:00  124.617 
 2021-02-07 10:04:00  123.946
 2021-02-07 16:00:00  123.300
   'I dont want lines inserted here. As this date is not
   present in the original dataset (could be a non trading
   day so I dont want to fill this gap)'
 2021-02-09 09:04:00  125.746
 2021-02-09 09:05:00  125.646
 2021-02-09 15:58:00  125.235
   'Fill the gaps here again but only between 09:30 and 16:00 time'
 2021-02-09 15:59:00  126.987
 2021-02-09 16:00:00  127.124

我尝试过的方法是：

'# set the index column'
df_process.set_index('Exchange DateTime', inplace=True)

'# resample and forward fill the gaps'
df_process_out = df_process.resample(rule='1T').ffill()

'# filter and return only timestamps between 09:30 and 16:00'
df_process_out = df_process_out.between_time(start_time='09:30:00', end_time='16:00:00')

然而，如果我采用这种方法，它也会对原始数据框中不存在的日期重新采样并生成新的时间戳。在上面的示例中，它还将为2021-02-08以分钟为基础生成时间戳。

有什么方法可以避免这种情况发生？

另外，是否有更好的方法来避免在整个时间范围内重新采样？

df_process_out = df_process.resample(rule='1T').ffill()

这段代码生成了00:00到24:00的时间戳，但在下一行代码中，我又需要过滤掉大部分时间戳，看起来不太高效。

如果有任何帮助/指导，将不胜感激。
谢谢。

编辑：
根据要求，提供一个小的样本集

df_in：输入数据
df_out_error：错误的输出数据
df_out_OK：输出数据应该是什么样子的

在以下ColabNotebook中，我准备了一个小样本。

https://colab.research.google.com/drive/1Fps2obTv1YPDpTzXTo7ivLI5njoI-y4n?usp=sharing

请注意，这只是数据的一个小子集。我正在尝试清理多年的数据，它们是结构化的，并显示像这样的缺少分钟时间戳。

- Chris Bauer

1

请创建一个小的可重现数据框，并提供完整的期望输出数据框。 - sammywemmy

你不想在 2021-02-07 10:04:00 和 2021-02-07 16:00:00 之间插入行的原因是什么？还是每分钟都需要填充？ - Akshay Sehgal

抱歉表述不清。是的，这也应该填写1分钟（或其他间隔）的时间戳。 - Chris Bauer

请测试我下面提到的代码，那应该能解决你的问题。 - Akshay Sehgal

它应该解决你所担心的两个问题，即针对有限时间段进行重新采样，以及仅对现有日期应用重新采样。 - Akshay Sehgal

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Akshay Sehgal · Accepted Answer

你可以通过组合 df.groupby() (按日期) 和使用 rule = "1Min" 的重采样来实现所需的功能。尝试这样做 -

df_new = (df.assign(date=df.Timestamp.dt.date)   #create new col 'date' from the timestamp
            .set_index('Timestamp')              #set timestamp as index
            .groupby('date')                     #groupby for each date
            .apply(lambda x: x.resample('1Min')  #apply resampling for 1 minute from start time to end time for that date
                   .ffill())                     #ffill values
            .reset_index('date', drop=True)      #drop index 'date' that was created by groupby
            .drop('date',1)                      #drop 'date' column created before
            .reset_index()                       #reset index to get back original 2 cols
         )

df_new

解释

1. 只对有限时间段进行重新采样

"此外，有没有更好的方法来避免对整个时间段进行重新采样。这将生成从00:00到24:00的时间戳，然后我必须再次过滤掉大部分时间戳。看起来效率不高。"

与上面的解决方案一样，您可以使用rule=1Min进行重采样，然后使用ffill（或任何其他类型的填充）。这确保您不会从00:00到24:00重新采样，而是只从数据中提供的开始到结束时间戳进行重新采样。为了证明这一点，我展示了将其应用于数据中的单个日期的结果：

#filtering for a single day
ddd = df[df['date']==df.date.unique()[0]]

#applying resampling on that given day
ddd.set_index('Timestamp').resample('1Min').ffill()

注意给定日期的起始时间（09:30:00）和结束时间（16:00:00）。

2. 仅对现有日期进行重新采样

"在上面的示例中，它还将为2021-02-08生成每分钟的时间戳。我该如何避免这种情况？"

与上面的解决方案一样，您可以分别对日期组应用重新采样方法。在这种情况下，我使用一个 lambda 函数在将日期从时间戳中分离后应用该方法。因此，重新采样只会在 数据集中存在的日期 上发生。

df_new.Timestamp.dt.date.unique()

array([datetime.date(2021, 2, 7), datetime.date(2021, 2, 9)], dtype=object)

注意，输出结果仅包含原始数据集中的2个唯一日期。