使用FFT分析Google趋势时间序列的季节性

Question

使用FFT分析Google趋势时间序列的季节性

7

我正在尝试使用快速傅里叶变换评估Google趋势时间序列的幅度谱。如果您查看提供的此处的“饮食”数据，它显示出非常强的季节性模式。

我认为可以使用FFT分析这个模式，预计应该有一个1年周期的强峰。

然而，当我像这样应用FFT时（a_gtrend_ham是与Hamming窗口相乘的时间序列）：

import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import fft, fftshift
import pandas as pd

gtrend = pd.read_csv('multiTimeline.csv',index_col=0)

gtrend.index = pd.to_datetime(gtrend.index, format='%Y-%m')

# Sampling rate
fs = 12 #Points per year

a_gtrend_orig = gtrend['diet: (Worldwide)']
N_gtrend_orig = len(a_gtrend_orig)
length_gtrend_orig = N_gtrend_orig / fs
t_gtrend_orig = np.linspace(0, length_gtrend_orig, num = N_gtrend_orig, endpoint = False)

a_gtrend_sel = a_gtrend_orig.loc['2005-01-01 00:00:00':'2017-12-01 00:00:00']
N_gtrend = len(a_gtrend_sel)
length_gtrend = N_gtrend / fs
t_gtrend = np.linspace(0, length_gtrend, num = N_gtrend, endpoint = False)

a_gtrend_zero_mean = a_gtrend_sel - np.mean(a_gtrend_sel)
ham = np.hamming(len(a_gtrend_zero_mean))
a_gtrend_ham = a_gtrend_zero_mean * ham

N_gtrend = len(a_gtrend_ham)
ampl_gtrend = 1/N_gtrend * abs(fft(a_gtrend_ham))
mag_gtrend = fftshift(ampl_gtrend)
freq_gtrend = np.linspace(-0.5, 0.5, len(ampl_gtrend))
response_gtrend = 20 * np.log10(mag_gtrend)
response_gtrend = np.clip(response_gtrend, -100, 100)

我的幅度谱结果没有显示出任何主导峰。

我在如何使用FFT获取数据序列频谱方面存在误解吗？

- Axel

您能否请包含读取数据和应用 Hamming 窗口的代码行？我非常确定我可以为您回答这个问题。完整（但最小化）的代码示例将为我节省一些时间。 - DrM

好的，明白了。您的gtrend.csv文件是否只是从网站上复制的multiTimeline.csv文件？此外，您可能还需要一些pandas和numpy的导入语句。假设数据文件可以直接使用，我会尽快回复您。 - DrM

是的，gtrend.csv只是multiTimeline.csv的一个副本。我将在一秒钟内添加导入语句。 - Axel

你还需要加上skiprows=2，否则移位后的log10会出现除以零的运行时错误。但这也可能是可以接受的，作为一个起点。我先把它放一下，几分钟后再来看。解决这个问题应该不会花费太长时间。 - DrM

我找到了高频成分的来源，请看一下。它是原始数据中尖峰的快速上升和下降。 - DrM

显示剩余3条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- DrM · Accepted Answer

这是一个干净的实现，我认为你试图完成的目标。我包括了图形输出和简要讨论其可能含义。

首先，我们使用rfft()，因为数据是实值的。这可以节省时间和精力（并降低生成冗余负频率所带来的错误率），否则需要生成多余的负频率。我们使用rfftfreq()生成频率列表（同样不需要手动编码，使用API可以降低错误率）。

对于你的数据，Tukey窗口比Hamming和类似的cos或sin基窗口函数更合适。请注意，在乘以窗口函数之前，我们会减去中位数。median()是一个相当健壮的基线估计方法，肯定比mean()更加健壮。

在图表中，您可以看到数据从其初始值迅速下降，然后结束。Hamming和类似的窗口函数将中间部分过于狭窄地采样，这样做会无谓地衰减大量有用的数据。

对于FT图，我们跳过零频率bin（第一个点），因为它只包含基线，并省略它提供了更方便的y轴缩放。

您会注意到在FT输出的图表中有一些高频成分。下面是一个示例代码，说明了这些高频成分的可能来源。

好的，这是代码：

import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import rfft, rfftfreq
from scipy.signal import tukey

from numpy.fft import fft, fftshift
import pandas as pd

gtrend = pd.read_csv('multiTimeline.csv',index_col=0,skiprows=2)
#print(gtrend)

gtrend.index = pd.to_datetime(gtrend.index, format='%Y-%m')
#print(gtrend.index)

a_gtrend_orig = gtrend['diet: (Worldwide)']
t_gtrend_orig = np.linspace( 0, len(a_gtrend_orig)/12, len(a_gtrend_orig), endpoint=False )

a_gtrend_windowed = (a_gtrend_orig-np.median( a_gtrend_orig ))*tukey( len(a_gtrend_orig) )

plt.subplot( 2, 1, 1 )
plt.plot( t_gtrend_orig, a_gtrend_orig, label='raw data'  )
plt.plot( t_gtrend_orig, a_gtrend_windowed, label='windowed data' )
plt.xlabel( 'years' )
plt.legend()

a_gtrend_psd = abs(rfft( a_gtrend_orig ))
a_gtrend_psdtukey = abs(rfft( a_gtrend_windowed ) )

# Notice that we assert the delta-time here,
# It would be better to get it from the data.
a_gtrend_freqs = rfftfreq( len(a_gtrend_orig), d = 1./12. )

# For the PSD graph, we skip the first two points, this brings us more into a useful scale
# those points represent the baseline (or mean), and are usually not relevant to the analysis
plt.subplot( 2, 1, 2 )
plt.plot( a_gtrend_freqs[1:], a_gtrend_psd[1:], label='psd raw data' )
plt.plot( a_gtrend_freqs[1:], a_gtrend_psdtukey[1:], label='windowed psd' )
plt.xlabel( 'frequency ($yr^{-1}$)' )
plt.legend()

plt.tight_layout()
plt.show()

这是图形化显示的输出结果。在1/年和0.14（恰好是1/14年的1/2）处有强烈的信号，还有一组高频信号，乍一看可能会非常神秘。

我们可以看到，窗函数实际上非常有效地将数据带到基线，并且您可以看到，在应用窗函数时，FT中的相对信号强度并没有被很大程度地改变。

如果您仔细观察数据，似乎在一年内有一些重复的变化。如果这些变化具有一定的规律性，它们可以预计会出现在FT中作为信号，实际上，FT中的信号存在与否常常用于区分信号和噪声。但正如将要展示的那样，高频信号有一个更好的解释。

好的，现在这里有一段示例代码，它演示了产生高频成分的一种方法。在这个代码中，我们创建一个单频信号，然后创建一组与该信号频率相同的尖峰。然后，我们对这两个信号进行傅里叶变换，最后绘制原始数据和傅里叶变换数据。

import matplotlib.pyplot as plt
import numpy as np
from numpy.fft import rfft, rfftfreq

t = np.linspace( 0, 1, 1000. )

y = np.cos( 50*3.14*t )

y2 = [ 1. if 1.-v < 0.01 else 0. for v in y ]

plt.subplot( 2, 1, 1 )
plt.plot( t, y, label='tone' )
plt.plot( t, y2, label='spikes' )
plt.xlabel('time')

plt.subplot( 2, 1, 2 )
plt.plot( rfftfreq(len(y),d=1/100.), abs( rfft(y) ), label='tone' )
plt.plot( rfftfreq(len(y2),d=1/100.), abs( rfft(y2) ), label='spikes' )
plt.xlabel('frequency')
plt.legend()

plt.tight_layout()
plt.show()

好的，这里是音调和尖峰的图形以及它们的傅里叶变换。请注意，尖峰产生的高频成分非常类似于我们数据中的那些。

换句话说，高频成分的起源很可能在原始数据中信号的尖峰特征所关联的短时间尺度内。