时间序列分析-不均匀间隔测量-pandas+statsmodels

17
我有两个numpy数组light_points 和time_points,希望在这些数据上使用一些时间序列分析方法。 然后我尝试了这个:
import statsmodels.api as sm
import pandas as pd
tdf = pd.DataFrame({'time':time_points[:]})
rdf =  pd.DataFrame({'light':light_points[:]})
rdf.index = pd.DatetimeIndex(freq='w',start=0,periods=len(rdf.light))
#rdf.index = pd.DatetimeIndex(tdf['time'])

这段代码能够运行,但实现的不是我期望的功能。 实际上,这些测量数据的时间间隔不均匀。如果我只是将time_points列声明为pandas DataFrame中的索引,会出现错误:

rdf.index = pd.DatetimeIndex(tdf['time'])

decomp = sm.tsa.seasonal_decompose(rdf)

elif freq is None:
raise ValueError("You must specify a freq or x must be a pandas object with a timeseries index")

ValueError: You must specify a freq or x must be a pandas object with a timeseries index

我不知道如何更正这个错误。 另外,似乎 pandas 的 TimeSeries 已经被弃用了。

我尝试了这个:

rdf = pd.Series({'light':light_points[:]})
rdf.index = pd.DatetimeIndex(tdf['time'])

但它给了我一个长度不匹配的错误:

ValueError: Length mismatch: Expected axis has 1 elements, new values have 122 elements

尽管如此,我不明白它来自哪里,因为rdf ['light']和tdf ['time']的长度相同...

最终,我尝试将我的rdf定义为一个Pandas Series:

rdf = pd.Series(light_points[:],index=pd.DatetimeIndex(time_points[:]))

我得到了这个:

ValueError: You must specify a freq or x must be a pandas object with a timeseries index
然后,我尝试着用

替换了索引。
 pd.TimeSeries(time_points[:])

而且在 seasonal_decompose 方法行上出现了错误:

AttributeError: 'Float64Index' object has no attribute 'inferred_freq'

如何处理不均匀间距的数据?我曾考虑创建一个大量未知值的近似均匀间距时间数组,在现有值之间使用插值方式来“评估”这些点,但我认为可能存在更干净、更简单的解决方案。


如果您发布一个最小、完整和可验证的示例,那么您将增加获得良好答案的机会。 - Mike Müller
1个回答

21

seasonal_decompose()需要一个freq参数,可以通过DateTimeIndex元信息提供,也可以通过pandas.Index.inferred_freq推断出来,或者由用户作为一个整数来指定,表示每个周期的周期数。例如,对于月度数据,可以设置为12(从seasonal_meandocstring中获取):

def seasonal_decompose(x, model="additive", filt=None, freq=None):
    """
    Parameters
    ----------
    x : array-like
        Time series
    model : str {"additive", "multiplicative"}
        Type of seasonal component. Abbreviations are accepted.
    filt : array-like
        The filter coefficients for filtering out the seasonal component.
        The default is a symmetric moving average.
    freq : int, optional
        Frequency of the series. Must be used if x is not a pandas
        object with a timeseries index.
为了说明 - 使用随机样本数据:
length = 400
x = np.sin(np.arange(length)) * 10 + np.random.randn(length)
df = pd.DataFrame(data=x, index=pd.date_range(start=datetime(2015, 1, 1), periods=length, freq='w'), columns=['value'])

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 400 entries, 2015-01-04 to 2022-08-28
Freq: W-SUN

decomp = sm.tsa.seasonal_decompose(df)
data = pd.concat([df, decomp.trend, decomp.seasonal, decomp.resid], axis=1)
data.columns = ['series', 'trend', 'seasonal', 'resid']

Data columns (total 4 columns):
series      400 non-null float64
trend       348 non-null float64
seasonal    400 non-null float64
resid       348 non-null float64
dtypes: float64(4)
memory usage: 15.6 KB

到目前为止,一切都很好——现在从 DateTimeIndex 中随机删除元素以创建不均匀的空间数据:

df = df.iloc[np.unique(np.random.randint(low=0, high=length, size=length * .8))]

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 222 entries, 2015-01-11 to 2022-08-21
Data columns (total 1 columns):
value    222 non-null float64
dtypes: float64(1)
memory usage: 3.5 KB

df.index.freq

None

df.index.inferred_freq

None

对这些数据运行seasonal_decomp“有效”:

decomp = sm.tsa.seasonal_decompose(df, freq=52)

data = pd.concat([df, decomp.trend, decomp.seasonal, decomp.resid], axis=1)
data.columns = ['series', 'trend', 'seasonal', 'resid']

DatetimeIndex: 224 entries, 2015-01-04 to 2022-08-07
Data columns (total 4 columns):
series      224 non-null float64
trend       172 non-null float64
seasonal    224 non-null float64
resid       172 non-null float64
dtypes: float64(4)
memory usage: 8.8 KB

问题是-结果有多有用。即使没有数据中的空缺,这也会使季节性模式的推断变得复杂(请参见发布说明中对.interpolate()的示例使用),statsmodels也将此过程归类如下:
Notes
-----
This is a naive decomposition. More sophisticated methods should
be preferred.

The additive model is Y[t] = T[t] + S[t] + e[t]

The multiplicative model is Y[t] = T[t] * S[t] * e[t]

The seasonal component is first removed by applying a convolution
filter to the data. The average of this smoothed series for each
period is the returned seasonal component.

2
你为什么使用 freq=52,为什么不用其他数字? - Rocketq
已经有一段时间了,但我相信这是因为我的示例使用了每周随机数据 - 请参见上文。 - Stefan

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接