Python：找出列表中的异常值

Question

Python：找出列表中的异常值

4

我有一个列表，其中包含随机数量的整数和/或浮点数。我的目标是找出我的数字中的异常（希望使用正确的词语来解释这一点）。例如：

list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]

我的整数值中，90到99%的值都在1到20之间。
有时会出现远高于此范围的值，比如100或1,000甚至更多。

我的问题是这些值每次都可能不同。也许正常范围在1,000到1,200之间，而异常值则在50万左右。

是否有函数可以过滤掉这些特殊的数字？

- finethen

1

类似于计算标准差的东西？ - DeepSpace

3

你正在寻找“异常值（outliers）”。难点在于如何定义异常值。如果你的大部分数字符合某个分布，比如正态分布，你可以将数据拟合到该分布中，找出那些不太可能来自该分布的点。 - James

这个回答解决了你的问题吗？https://dev59.com/frXna4cB1Zd3GeqPH0Ig - Ronald

@James 谢谢！即使知道它们被称为“异常值”也对我的搜索有所帮助。 - finethen

3个回答

-1

您可以使用内置的filter()方法：

lst1 = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]

lst2 = list(filter(lambda x: x > 5,lst1))

print(lst2)

输出：

[14, 108, 8, 97]

- Ann Zen

1

我的问题是，这些值可能一直不同。也许正常范围在1,000到1,200之间，而异常范围则在50万左右。我认为这里的想法是不要硬编码5或任何其他值。 - DeepSpace

@DeepSpace 你说的“regular range”是什么意思？ - Ann Zen

OP 意味着被认为是可接受的数字范围。 - DeepSpace

-3

所以这里有一个方法，可以阻止那些偏离主题的人

import math
_list = [1, 3, 2, 14, 108, 2, 1, 8, 97, 1, 4, 3, 5]
def consts(_list):
    mu = 0
    for i in _list:
        mu += i
    mu = mu/len(_list)
    sigma = 0
    for i in _list:
        sigma += math.pow(i-mu,2)
    sigma = math.sqrt(sigma/len(_list))
    return sigma, mu

def frequence(x, sigma, mu):
    return (1/(sigma*math.sqrt(2*math.pi)))*math.exp(-(1/2)*math.pow(((x-mu)/sigma),2))

sigma, mu = consts(_list)

new_list = []
for i in range(len(_list)):
    if frequence(_list[i], sigma, mu) > 0.01:
        new_list.append(i)
print(new_list)

- mama

我的问题是，这些值可能一直不同。也许正常范围在1,000到1,200之间，而异常范围在50万左右。我认为这里的想法是不要硬编码20或任何其他值。此外，在迭代时删除列表中的元素永远不是一个好主意（即使您发布的代码也会导致IndexError）。 - DeepSpace

好的，如果您不喜欢它，我可以创建一个函数来检测正态分布并删除那些不被接受的数据。 - mama

@DeepSpace 你说得对！这个想法确实是不要像这种情况下的20那样有一个硬编码的值。 - finethen

是的，如果你创建了一个正态分布并删除了不太正常的部分，那么你就会得到最大和最小值（离群值），然后你只需要将它们移除即可。 :) - mama

@mama 我试过了，但不幸的是它并没有很好地工作。函数弹出了一些应该在接受范围内的值。但无论如何，非常感谢你的帮助！ - finethen

很高兴为您服务 :) - mama

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ehsan · Accepted Answer

假设你的列表是l：

If you know you want to filter a certain percentile/quantile, you can use:

This removes bottom 10% and top 90%. Of course, you can change any of them to your desired cut-off (for example you can remove the bottom filter and only filter the top 90% in your example):
```
import numpy as np
l = np.array(l)
l = l[(l>np.quantile(l,0.1)) & (l<np.quantile(l,0.9))].tolist()
```
output:
```
[ 3  2 14  2  8  4  3  5]
```
If you are not sure of the percentile cut-off and are looking to remove outliers:

You can adjust your cut-off for outliers by adjusting argument m in function call. The larger it is, the less outliers are removed. This function seems to be more robust to various types of outliers compared to other outlier removal techniques.
```
 import numpy as np 
 l = np.array(l) 
 def reject_outliers(data, m=6.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d / (mdev if mdev else 1.)
    return data[s < m].tolist()
 print(reject_outliers(l))
```
output:
```
[1, 3, 2, 14, 2, 1, 8, 1, 4, 3, 5]
```