基于Python的移动平均异常值检测

Question

基于Python的移动平均异常值检测

3

我正在尝试将MATLAB中的算法转译成Python。该算法处理大型数据集，需要应用异常值检测和消除技术。

在MATLAB代码中，我使用的异常值删除技术是 movmedian：

   Outlier_T=isoutlier(Data_raw.Temperatura,'movmedian',3);
   Data_raw(find(Outlier_T),:)=[]

使用滑动中位数检测异常值，通过查找三个值移动的窗口中心的不成比例值。因此，如果我有一列名为“温度”的数据，在第3行上有一个40，则它将被检测到并删除整个行。

         Temperatura     Date       
    1        24.72        2.3        
    2        25.76        4.6        
    3        40           7.0        
    4        25.31        9.3        
    5        26.21       15.6
    6        26.59       17.9        
   ...        ...         ...

据我理解，这可以通过pandas.DataFrame.rolling实现。我看到过一些帖子展示了它的使用，但是我无法让它在我的代码中正常工作： 尝试A：

Dataframe.rolling(df["t_new"]))

尝试B：

df-df.rolling(3).median().abs()>200

#基于@Ami Tavory的答案

我是否漏掉了一些显而易见的东西？ 做这件事的正确方式是什么？感谢你的时间。

- enricw

1

有一个笔误。请尝试用“中位数”替换“meadian”。 - Nilesh Ingle

对不起，我写了这个，但是错别字不在代码里。 - enricw

好的。谢谢。我在下面发布了一个使用滚动中位数的答案。 - Nilesh Ingle

3个回答

2

迟来的派对，基于Nilesh Ingle的答案。修改为更加通用、详细（图表！），使用百分比阈值代替数据的实际值。

# Calculate rolling median
df["Temp_Rolling"] = df["Temp"].rolling(window=3).median()

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df["Temp_Scaled"] = scaler.fit_transform(df["Temp"].values.reshape(-1, 1))

df["Temp_Rolling"] = scaler.fit_transform(df["Temp_Rolling"].values.reshape(-1, 1))

# Calculate difference
df["Temp_Diff"] = df["Temp_Scaled"] - df["Temp_Rolling"]

import numpy as np
import matplotlib.pyplot as plt

# Set threshold for difference with rolling median
upper_threshold = 0.4
lower_threshold = -0.4

# Flag rows to be keepped True
df["Temp_Keep_Flag"] = np.where( (df["Temp_Diff"] > upper_threshold) | (df["Temp_Diff"] < lower_threshold), False, True)

# Keep flagged rows
print('dropped rows')
print(df[~df["Temp_Keep_Flag"]].index)
print('Your new graph')
df_result = df[df["Temp_Keep_Flag"].values]
df_result["Temp"].plot()

一旦您满意于数据清洗

# Satisfied, replace data
df = df[df["Temp_Keep_Flag"].values]
df.drop(columns=["Temp_Rolling", "Temp_Diff", "Temp_Keep_Flag"], inplace=True)
df.plot()

- Julius Chai

1

Julius，我喜欢你的评论。但是为了清晰起见，在你的解决方案中你从未定义df_scaled。 - Caleb Sprague

1

尼莱什的答案完美地解决了问题，如果你想迭代他的代码，你也可以这样做：

upper_threshold = 1
lower_threshold = -1

# Calculate rolling median
df['rolling_temp'] = df['Temp'].rolling(window=3).median()
# all in one line 
df = df.drop(df[(df['Temp']-df['rolling_temp']>upper_threshold)|(df['Temp']- df['rolling_temp']<lower_threshold)].index) 
# if you want to drop the column as well
del df["rolling_temp"]

- el_bobo

谢谢您的回复。我无法像您一样使Nilseh的代码正常工作。我编辑了原帖并附上了错误信息。您有任何想法吗？ - enricw

不确定，可能有很多事情。你能试试我的版本吗？也可能是索引问题。你的df中值445在哪里？ - el_bobo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nilesh Ingle · Accepted Answer

以下代码基于阈值删除行。如有需要，此阈值可进行调整。不确定是否与Matlab代码复制一致。

# Import Libraries
import pandas as pd
import numpy as np

# Create DataFrame
df = pd.DataFrame({
    'Temperatura': [24.72, 25.76, 40, 25.31, 26.21, 26.59],
    'Date':[2.3,4.6,7.0,9.3,15.6,17.9]
})

# Set threshold for difference with rolling median
upper_threshold = 1
lower_threshold = -1

# Calculate rolling median
df['rolling_temp'] = df['Temperatura'].rolling(window=3).median()

# Calculate difference
df['diff'] = df['Temperatura'] - df['rolling_temp']

# Flag rows to be dropped as `1`
df['drop_flag'] = np.where((df['diff']>upper_threshold)|(df['diff']<lower_threshold),1,0)

# Drop flagged rows
df = df[df['drop_flag']!=1]
df = df.drop(['rolling_temp', 'rolling_temp', 'diff', 'drop_flag'],axis=1)

输出

print(df)

   Temperatura  Date
0        24.72   2.3
1        25.76   4.6
3        25.31   9.3
4        26.21  15.6
5        26.59  17.9