我每隔5秒从远程设备读取一些数据。
它们被保存为:
2018-01-01 00:00:00 2
2018-01-01 00:00:05 3
2018-01-01 00:00:10 3
2018-01-01 00:00:15 2
2018-01-01 00:00:20 3
2018-01-01 00:00:25 4
2018-01-01 00:00:30 3
2018-01-01 00:00:35 2
2018-01-01 00:00:40 4
2018-01-01 00:00:45 5
2018-01-01 00:00:50 3
2018-01-01 00:00:55 3
遗憾的是,通信并不总是最好的,有时候通信无法正常工作。
在这种情况下,远程设备将尽快提供cumulative value(累积值)的数据。
之前的数据可以保存为:
2018-01-01 00:00:00 2
2018-01-01 00:00:05 3
2018-01-01 00:00:10 3
.......... 00:00:15 missing...
.......... 00:00:20 missing...
.......... 00:00:25 missing...
2018-01-01 00:00:30 12 <--- sum of the last 4 readings
2018-01-01 00:00:35 2
.......... 00:00:40 missing...
.......... 00:00:45 missing...
2018-01-01 00:00:50 15 <--- sum of the last 3 readings
2018-01-01 00:00:55 3
我需要填充所有缺失的行,并将原始数据中的峰值用在峰值上计算出的平均值来移除。
重采样很容易:
2018-01-01 00:00:00 2
2018-01-01 00:00:05 3
2018-01-01 00:00:10 3
2018-01-01 00:00:15 NaN
2018-01-01 00:00:20 NaN
2018-01-01 00:00:25 NaN
2018-01-01 00:00:30 12
2018-01-01 00:00:35 2
2018-01-01 00:00:40 NaN
2018-01-01 00:00:45 NaN
2018-01-01 00:00:50 15
2018-01-01 00:00:55 3
但是如何填充NaN并消除峰值呢?
我查看了asfreq
和resample
的各种方法,但在这种情况下,它们中的任何一个(bfill
,ffill
)都没有用。
最终结果应该是:
2018-01-01 00:00:00 2
2018-01-01 00:00:05 3
2018-01-01 00:00:10 3
2018-01-01 00:00:15 3 <--- NaN filled with mean = peak 12/4 rows
2018-01-01 00:00:20 3 <--- NaN filled with mean
2018-01-01 00:00:25 3 <--- NaN filled with mean
2018-01-01 00:00:30 3 <--- peak changed
2018-01-01 00:00:35 2
2018-01-01 00:00:40 5 <--- NaN filled with mean = peak 15/3 rows
2018-01-01 00:00:45 5 <--- NaN filled with mean
2018-01-01 00:00:50 5 <--- peak changed
2018-01-01 00:00:55 3
我用于测试的数据框:
import numpy as np
import pandas as pd
time = pd.date_range(start='2021-01-01', freq='5s', periods=12)
read_data = pd.Series([2, 3, 3, np.nan, np.nan, np.nan, 12, 2, np.nan, np.nan, 15, 3], index=time).dropna()
read_data.asfreq("5s")