高效时间序列滑动窗口函数。

3

我正尝试为时间序列创建一个滑动窗口。到目前为止,我已经编写了一个函数,可以让你采取一个给定的系列,在秒数中设置一个窗口大小,然后创建一个滚动样本。但我的问题是它运行得非常慢,而且似乎不是一种高效的方法。

# ========== create dataset  =========================== #

import pandas as pd
from datetime import timedelta, datetime


timestamp_list = ["2022-02-07 11:38:08.625",
                  "2022-02-07 11:38:09.676", 
                  "2022-02-07 11:38:10.084", 
                  "2022-02-07 11:38:10.10000",  
                  "2022-02-07 11:38:11.2320"]

bid_price_list = [1.14338, 
                  1.14341, 
                  1.14340, 
                  1.1434334, 
                  1.1534334]

df = pd.DataFrame.from_dict(zip(timestamp_list, bid_price_list))
df.columns = ['timestamp','value']

# make date time object
df.timestamp = [datetime.strptime(time_i, "%Y-%m-%d %H:%M:%S.%f") for time_i in df.timestamp]

df.head(3)
timestamp   value   timestamp_to_sec
0   2022-02-07 11:38:08.625 1.14338 2022-02-07 11:38:08
1   2022-02-07 11:38:09.676 1.14341 2022-02-07 11:38:09
2   2022-02-07 11:38:10.084 1.14340 2022-02-07 11:38:10

# ========== create rolling time-series function  ====== #


# get the floor of time (second value)
df["timestamp_to_sec"]  = df["timestamp"].dt.floor('s')

# set rollling window length in seconds
window_dt = pd.Timedelta(seconds=2)

# containers for rolling sample statistics
n_list = []
mean_list = []
std_list =[]

# add dt (window) seconds to the original time which was floored to the second
df["timestamp_to_sec_dt"] = df["timestamp_to_sec"]  + window_dt

# get unique end times
time_unique_endlist = np.unique(df.timestamp_to_sec_dt)

# remove end times that are greater than the last actual time, i.e. max(df["timestamp_to_sec"])
time_unique_endlist = time_unique_endlist[time_unique_endlist <= max(df["timestamp_to_sec"])]

# loop running the sliding window (time_i is the end time of each window)
for time_i in time_unique_endlist:
    
    # start time of each rolling window
    start_time = time_i - window_dt
    
    # sample for each time period of sliding window
    rolling_sample = df[(df.timestamp >= start_time) & (df.timestamp <= time_i)]

    
    # calculate the sample statistics
    n_list.append(len(rolling_sample)) # store n observation count
    mean_list.append(rolling_sample.mean()) # store rolling sample mean
    std_list.append(rolling_sample.std()) # store rolling sample standard deviation
    
    # plot histogram for each sample of the rolling sample
    #plt.hist(rolling_sample.value, bins=10)

# tested and n_list brought back the correct values
>>> n_list
[2,3]

有没有更有效率的方法来完成这个任务?或者有没有办法改进我的解释,或者有没有一种开源软件包可以让我像这样运行一个滚动窗口?我知道pandas中有.rolling()函数,但它是基于数值滚动的。我需要一些可以用在非等间隔数据上的东西,使用时间来定义固定的滚动窗口。

1个回答

1

看起来这是最佳表现,希望能对其他人有所帮助。

# set rollling window length in seconds
window_dt = pd.Timedelta(seconds=2)

# add dt seconds to the original timestep
df["timestamp_to_sec_dt"] = df["timestamp_to_sec"]  + window_dt

# unique end time
time_unique_endlist = np.unique(df.timestamp_to_sec_dt)

# remove end values that are greater than the last actual value, i.e. max(df["timestamp_to_sec"])
time_unique_endlist = time_unique_endlist[time_unique_endlist <= max(df["timestamp_to_sec"])]

# containers for rolling sample statistics
mydic = {}
counter = 0

# loop running the rolling window
for time_i in time_unique_endlist:
    
    start_time = time_i - window_dt
    
    # sample for each time period of sliding window
    rolling_sample = df[(df.timestamp >= start_time) & (df.timestamp <= time_i)]

    # calculate the sample statistics
    mydic[counter] = {
                        "sample_size":len(rolling_sample),
                        "sample_mean":rolling_sample["value"].mean(),
                        "sample_std":rolling_sample["value"].std()
                        }
    counter = counter + 1

# results in a DataFrame
results = pd.DataFrame.from_dict(mydic).T

这个方法很好用,但是如果你从一个非常大的文件中读取数据,或者自动从连续的日志中读取数据,你怎么避免生成一个无限大的数据框呢?另外,pandas 数据框不适合用于追加新条目,因为在这个过程中你需要复制整个数据框... 有什么办法可以解决这个问题,让 pandas 可以用于一个“无限”的时间序列,并且具有滚动窗口的功能吗? - Jinx

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接