高效计算非零值的连续数目

Question

高效计算非零值的连续数目

4

我正在处理降雨量的时间序列，希望计算单个降雨事件的长度和体积，其中“事件”是一系列非零时间步长。我正在处理多个时间序列，每个序列有大约60k个时间步长，但目前的方法速度相当慢。

目前我的方法如下：

import numpy as np

def count_events(timeseries):
    start = 0  
    end = 0
    lengths = []
    volumes = []
    # pad a 0 at the edges so as to include edges as "events"
    for i, val in enumerate(np.pad(timeseries, pad_width = 1, mode = 'constant')):

        if val > 0 and start==0:
            start = i
        if val == 0 and start>0:
            end = i

            if end - start != 1:
                volumes.append(np.sum(timeseries[start:end]))
            elif end - start == 1:
                volumes.append(timeseries[start-1])

            lengths.append(end-start)
            start = 0

    return np.asarray(lengths), np.asarray(volumes)

期望输出：

testrain = np.array([1,0,1,0,2,2,8,2,0,0,0.1,0,0,1])
lengths, volumes = count_events(testrain)
print lengths
[1 1 4 1 1]
print volumes
[  1.    1.   12.    0.1   1. ] # 12 should actually be 14, my code returns wrong results.

我想到了一个更好的方法来做这件事，利用numpy的效率，但是目前还没有想到其他的方法...

编辑：

比较不同的解决方案：

testrain = np.random.normal(10,5, 60000)
testrain[testrain<0] = 0

我的解决方案（产生错误结果，不确定原因）：

%timeit count_events(testrain)
#10 loops, best of 3: 129 ms per loop

@dawg的：

%timeit dawg(testrain) # using itertools
#10 loops, best of 3: 113 ms per loop
%timeit dawg2(testrain) # using pure numpy
#10 loops, best of 3: 156 ms per loop

"@DSM's："

%timeit DSM(testrain)
#10 loops, best of 3: 28.4 ms per loop

@DanielLenz的：

%timeit DanielLenz(testrain)
#10 loops, best of 3: 316 ms per loop

- areuexperienced

你可以使用 np.diff 和 np.where 来实现：diff 可以找到序列发生变化的点。 - user707650

@Divakar 抱歉，已做出修改。 - areuexperienced

你能解释一下你的输出背后的逻辑吗？你想如何获取“长度”和“体积”？ - Mazdak

通过时间序列计数，一个单独的“事件”的长度（以时间步长的数量为度量）就是在遇到非零值（即风暴开始）的索引和遇到零值（即风暴结束）的索引之间的差异。体积是这些索引之间实际值的总和。 - areuexperienced

你确定要一个第三组的音量为12吗？难道不应该是14吗？ - DSM

显示剩余2条评论

3个回答

4

以下是一种分组解决方案：

import numpy as np
from itertools import groupby

testrain = np.array([1,0,1,0,2,2,8,2,0,0,0.1,0,0,1])

lengths=[]
volumes=[]
for k, l in groupby(testrain, key=lambda v: v>0):
    if k:
        li=list(l)
        lengths.append(len(li))
        volumes.append(sum(li))

print lengths     
print volumes

打印

[1, 1, 4, 1, 1]
[1.0, 1.0, 14.0, 0.10000000000000001, 1.0]

如果您想要纯粹使用numpy实现，请参考以下内容：

def find_runs(arr):
    subs=np.split(testrain, np.where(testrain== 0.)[0])
    arrs=[np.delete(sub, np.where(sub==0.)) for sub in subs]
    return [(len(e), sum(e)) for e in arrs if len(e)]

>>> find_runs(testrain)    
[(1, 1.0), (1, 1.0), (4, 14.0), (1, 0.10000000000000001), (1, 1.0)]
>>> length, volume=zip(*find_runs(testrain))

- dawg

不错。出于兴趣，我现在会把它保持开放状态，看看是否有基于numpy的解决方案。 - areuexperienced

1

这是我的方法，使用来自scipy.ndimage.measurements的labels：

import numpy as np
from scipy.ndimage.measurements import label

testrain = np.array([1,0,1,0,2,2,8,2,0,0,0.1,0,0,1])
labels, nlabels = label(testrain)
labels
>> array([1, 0, 2, 0, 3, 3, 3, 3, 0, 0, 4, 0, 0, 5], dtype=int32)

def sum_and_length(n):
    obj = np.array(testrain[labels==n])
    return [np.sum(obj), obj.size]

sums, lengths = np.array(map(sum_and_length, range(1, nlabels+1))).T
sums
>> array([  1. ,   1. ,  14. ,   0.1,   1. ])
lenghts
>> array([ 1.,  1.,  4.,  1.,  1.])

这并非最佳方案，因为该问题非常适合使用 pandas 来解决，但它可能会让你了解 measurements 这个非常强大的工具集。

- Daniel Lenz

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- DSM · Accepted Answer

虽然你可以在纯numpy中完成这个任务，但实际上你正在将numpy应用于pandas问题。你的volume是一个分组操作的结果，在numpy中可以模拟，但在pandas中是本地支持的。

例如：

>>> tr = pd.Series(testrain)
>>> nonzero = (tr != 0)
>>> group_ids = (nonzero & (nonzero != nonzero.shift())).cumsum()
>>> events = tr[nonzero].groupby(group_ids).agg([sum, len])
>>> events
    sum  len
1   1.0    1
2   1.0    1
3  14.0    4
4   0.1    1
5   1.0    1