带有容差的numpy 1d数组阈值过滤器

4
假设我有一个 1 维的 numpy 数组,其中包含一些嘈杂的数据序列。
我想建立一个阈值来检查何时数值为高和低。但是由于数据存在噪声,所以仅仅这样做是没有意义的。
is_high = data > threshold

我试图给这个阈值设置一个公差,就像许多控制系统所做的那样(例如大多数暖气和空调系统)。其想法是当信号通过阈值加上公差时,状态仅从低变高。同样地,当信号低于阈值减去公差时,信号将仅从高变低。换句话说:

def tolerance_filter(data, threshold, tolerance):
    currently_high = False  # start low
    signal_state = np.empty_like(data, dtype=np.bool)
    for i in range(data.size):
        # if we were high and we are getting too low, become low
        if currently_high and data[i] < (threshold-tolerance):
            currently_high = False
        # if we were low and are getting too high, become high
        elif not currently_high and data[i] > (threshold+tolerance):
            currently_high = True
        signal_state[i] = currently_high
    return signal_state

这个函数返回了我期望的输出。但是,我想知道是否有一种方法可以使用numpyscipy的速度来替代原始的Python for循环。有任何想法吗? :)
更新:
感谢Joe Kington的评论,他指向"hysteresis"这个术语,我找到了这个SO问题。恐怕它非常相似(重复?),并且那里还有Bas Swinckels的一个好的工作解决方案
无论如何,我尝试实现Joe Kington提出的加速建议(不知道我是否做对了),并将他的解决方案、Fergal的解决方案和Bas'与我的天真方法进行了比较。下面是结果(代码如下):
Proposed function in my original question
10 loops, best of 3: 22.6 ms per loop

Proposed function by Fergal
1000 loops, best of 3: 995 µs per loop

Proposed function by Bas Swinckels in the hysteresis question
1000 loops, best of 3: 1.05 ms per loop

Proposed function by Joe Kington using Cython
Approximate time cost of compiling: 2.195411
1000 loops, best of 3: 1.35 ms per loop

所有回答中的方法表现相似(尽管Fergal的需要一些额外步骤来获取布尔向量!)。这里有没有任何考虑要添加的内容?另外,我很惊讶Cython方法比较慢(尽管只是稍微慢了一点)。无论如何,我必须承认,如果您不熟悉所有numpy函数,那么它可能是编写最快的代码...

以下是我用于基准测试不同选项的代码。审计和修订都非常欢迎! :P (Cython代码在中间,以便SO保留所有代码在同一个可滚动的块中。当然,我将其放在一个不同的文件中)

# Naive approach from the original question
def tolerance_filter1(data, threshold, tolerance):
    currently_high = False  # start low
    signal_state = np.empty_like(data, dtype=np.bool)
    for i in range(data.size):
        # if we were high and we are getting too low, become low
        if currently_high and data[i] < (threshold-tolerance):
            currently_high = False
        # if we were low and are getting too high, become high
        elif not currently_high and data[i] > (threshold+tolerance):
            currently_high = True
        signal_state[i] = currently_high
    return signal_state
        
# Numpythonic approach suggested by Fergal
def tolerance_filter2(data, threshold, tolerance):
    a = np.zeros_like(data)
    a[ data < threshold-tolerance] = -1
    a[ data > threshold+tolerance] = +1
    wh = np.where(a != 0)[0]
    idx= np.diff( a[wh]) == 2
    #This variable indexes the values of data where data crosses
    #from below threshold-tol to above threshold+tol
    crossesAboveThreshold = wh[idx]
    return crossesAboveThreshold
    
# Approach suggested by Bas Swinckels and borrowed
# from the hysteresis question
def tolerance_filter3(data, threshold, tolerance, initial=False):
    hi = data >= threshold+tolerance
    lo_or_hi = (data <= threshold-tolerance) | hi
    ind = np.nonzero(lo_or_hi)[0]
    if not ind.size: # prevent index error if ind is empty
        return np.zeros_like(x, dtype=bool) | initial
    cnt = np.cumsum(lo_or_hi) # from 0 to len(x)
    return np.where(cnt, hi[ind[cnt-1]], initial)
    
#########################################################
## IN A DIFFERENT FILE (tolerance_filter_cython.pyx)
## So that StackOverflow shows a single scrollable code block :)

import numpy as np
import cython

@cython.boundscheck(False)
def tolerance_filter(data, float threshold, float tolerance):
    cdef bint currently_high = 0  # start low
    signal_state = np.empty_like(data, dtype=int)
    cdef double[:] data_view = data
    cdef long[:] signal_state_view = signal_state
    cdef int i = 0
    cdef int l = len(data)
    low = np.empty_like(data, dtype=bool)
    high = np.empty_like(data, dtype=bool)
    low = data < (threshold-tolerance)
    high = data > (threshold+tolerance)
    
    for i in range(l):
        # if we were high and we are getting too low, become low
        if currently_high and low[i]:
            currently_high = False
        # if we were low and are getting too high, become high
        elif not currently_high and high[i]:
            currently_high = True
        signal_state_view[i] = currently_high
    return signal_state

##################################################################
# BACK TO THE PYTHON FILE

import numpy as np
from time import clock
from datetime import datetime
from IPython import get_ipython
ipython = get_ipython()
time = np.arange(0,1000,0.01)
data = np.sin(time*3) + np.cos(time/7)*8 + np.random.normal(size=time.shape)*2
threshold, tolerance = 0, 4

print "Proposed function in my original question"
ipython.magic("timeit tolerance_filter1(data, threshold, tolerance)")

print "\nProposed function by Fergal"
ipython.magic("timeit tolerance_filter2(data, threshold, tolerance)")

print "\nProposed function by Bas Swinckels in the hysteresis question"
ipython.magic("timeit tolerance_filter3(data, threshold, tolerance)")

print "\nProposed function by Joe Kington using Cython"
start = datetime.now()
import pyximport; pyximport.install()
import tolerance_filter_cython
print "Approximate time cost of compiling: {}".format((datetime.now()-start).total_seconds())
tolerance_filter4 = tolerance_filter_cython.tolerance_filter
ipython.magic("timeit tolerance_filter4(data, threshold, tolerance)")

1
你是指某种滞后效应吗? - Ami Tavory
是的,滞后效应,就是它!谢谢,你明白这个术语了! :P - mgab
2个回答

2

我认为有时候惊讶的是,使用Cython扩展很容易像Python一样。以下是您的代码转换为Cython。这可以从Python中调用,但应该提供C++的速度。

def tolerance_filter(data, float threshold, float tolerance):
    cdef bint currently_high = 0  # start low
    signal_state = np.empty_like(data, dtype=int)
    cdef float[:] data_view = data
    cdef int[:] signal_state_view = signa_state
    cdef int i = 0
    cdef int l = len(data)
    for i in range(l):
        # if we were high and we are getting too low, become low
        if currently_high and data[i] < (threshold-tolerance):
            currently_high = False
        # if we were low and are getting too high, become high
        elif not currently_high and data[i] > (threshold+tolerance):
            currently_high = True
        signal_state_view[i] = currently_high

有几点需要注意:

  • 请注意函数开头使用了类型化内存视图

  • 该函数被故意保持尽可能接近原始代码。然而,可以通过关闭范围检查(参见Cython文档)并在循环之外计算上下有效阈值来加速它。


是啊!Cython 是我一直想要找时间去玩的东西……但从未有机会……太好了!这节课加1分! :P 顺便问一下,当你提到“关闭范围检查”时,是指 @cython.boundscheck(False) 装饰器吗?我不确定我在搜索中是否成功了…… - mgab
是的,那绝对是装饰器。祝你好运! - Ami Tavory

0

我不确定这是否比你的解决方案更符合numpythonic,但它可能会更好。

a = np.zeros_like(data)
a[ data < threshold-tol] = -1
a[ data > threshold+tol] = +1
wh = np.where(a != 0)
idx= np.diff( a[wh]) == 2
#This variable indexes the values of data where data crosses
#from below threshold-tol to above threshold+tol
crossesAboveThreshold = wh[idx]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接