如何在 Pandas 系列中计算到前一个零的距离？

Question

如何在 Pandas 系列中计算到前一个零的距离？

3

我有以下的Pandas系列（表示为列表）：

[7,2,0,3,4,2,5,0,3,4]

我想定义一个新的序列，返回到上一个零点之间的距离。也就是说，我希望以下输出结果：

[1,2,0,1,2,3,4,0,1,2]

在 pandas 中以最有效的方式如何做到这一点？

- Roman

8个回答

4

Pandas中的解决方案有点棘手，但可能看起来像这样（其中s是您的Series）：

>>> x = (s != 0).cumsum()
>>> y = x != x.shift()
>>> y.groupby((y != y.shift()).cumsum()).cumsum()
0    1
1    2
2    0
3    1
4    2
5    3
6    4
7    0
8    1
9    2
dtype: int64

对于最后一步，这里使用了 Pandas cookbook 中的 "itertools.groupby" 配方，链接在这里。

- Alex Riley

我很欣赏这种优雅的写法，但是这个方法需要进行大量的遍历和“groupby”操作，而这些操作在单次Cython扩展中可以轻松完成，所以有些浪费。 - Ami Tavory

我同意 - 如果性能很重要，最好使用Cython来实现这种类型的事情。在Pandas中也可以实现（正如食谱所示），如果没有Cython作为可用选项，那么这非常方便。 - Alex Riley

2

一个可能性能不如其他解决方案（我没有仔细检查），但在步骤上更容易理解（至少对我来说）的解决方案是：


df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df

df['flag'] = np.where(df['X'] == 0, 0, 1)
df['cumsum'] = df['flag'].cumsum()
df['offset'] = df['cumsum']
df.loc[df.flag==1, 'offset'] = np.nan
df['offset'] = df['offset'].fillna(method='ffill').fillna(0).astype(int)
df['final'] = df['cumsum'] - df['offset']

df

- Partha Mandal

1

有时候会惊讶地发现使用Cython可以轻松获得类C语言的速度。假设您的列.values给出了arr，那么：

cdef int[:, :, :] arr_view = arr
ret = np.zeros_like(arr)
cdef int[:, :, :] ret_view = ret

cdef int i, zero_count = 0
for i in range(len(ret)):
    zero_count = 0 if arr_view[i] == 0 else zero_count + 1
    ret_view[i] = zero_count

注意使用类型化内存视图，它们非常快。您可以通过对使用此函数进行@cython.boundscheck(False)修饰来进一步加速它。

- Ami Tavory

0

也许像@behzad.nouri所回答的那样，pandas并不是最好的工具，但这里有另一种变化：

df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})

z = df.ne(0).X
z.groupby((z != z.shift()).cumsum()).cumsum()

0    1
1    2
2    0
3    1
4    2
5    3
6    4
7    0
8    1
9    2
Name: X, dtype: int64

解决方案2：

如果您编写以下代码，您将获得几乎所有所需的内容，除了第一行从0而不是1开始：

df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df.eq(0).cumsum().groupby('X').cumcount()

0    0
1    1
2    0
3    1
4    2
5    3
6    4
7    0
8    1
9    2
dtype: int64

这是因为累积和从0开始计数。为了得到所需的结果，我在第一行添加了一个0，进行了所有计算，然后在最后删除了0以获得：

x = pd.Series([0], index=[0])
df = pd.concat([x, df])
df.eq(0).cumsum().groupby('X').cumcount().reset_index(drop=True).drop(0).reset_index(drop=True)

0    1
1    2
2    0
3    1
4    2
5    3
6    4
7    0
8    1
9    2
dtype: int64

- ali bakhtiari

0

使用Numpy accumulate 这是另一种方法。唯一需要注意的是，为了将计数器初始化为零，您需要在系列值之前插入一个零。

import numpy as np

# Define Python function
f = lambda a, b: 0 if b == 0 else a + 1

# Convert to Numpy ufunc
npf = np.frompyfunc(f, 2, 1)

# Apply recursively over series values
x = npf.accumulate(np.r_[0, s.values])[1:]

print(x)

array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2], dtype=object)

- Bill

0

以下是一种不使用groupby的方法：

((v:=pd.Series([7,2,0,3,4,2,5,0,3,4]).ne(0))
.cumsum()
.where(v.eq(0)).ffill().fillna(0)
.rsub(v.cumsum())
.astype(int)
.tolist())

输出：

[1, 2, 0, 1, 2, 3, 4, 0, 1, 2]

- rhug123

0

另一种选择

df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
zeros = np.r_[-1, np.where(df.X == 0)[0]]

def d0(a):
    return np.min(a[a>=0])
    
df.index.to_series().apply(lambda i: d0(i - zeros))

或者使用纯numpy

df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
a = np.arange(len(df))[:, None] - np.r_[-1 , np.where(df.X == 0)[0]][None]

np.min(a, where=a>=0, axis=1, initial=len(df))

- dimid

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- behzad.nouri · Accepted Answer

复杂度为O(n)。在Python中循环会使其速度变慢。如果序列中有k个零，且log k与序列长度相比微不足道，则可以采用O(n log k)的解决方案:

>>> izero = np.r_[-1, (ts == 0).nonzero()[0]]  # indices of zeros
>>> idx = np.arange(len(ts))
>>> idx - izero[np.searchsorted(izero - 1, idx) - 1]
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2])