使用 cumsum
和 boolean-indexing
的一种方法是 -
arr[np.isnan(arr).cumsum(1)>0] = np.nan
为了提高性能,最好使用
np.maximum.accumulate
。
arr[np.maximum.accumulate(np.isnan(arr),axis=1)] = np.nan
通过稍微扭曲使用broadcasting
的一种方法 -
n = arr.shape[1]
mask = np.isnan(arr)
idx = mask.argmax(1)
idx[~mask.any(1)] = n
arr[idx[:,None] <= np.arange(n)] = np.nan
示例运行 -
In [96]: arr
Out[96]:
array([[ 3., 5., nan, 2., 4.],
[ 9., 1., 3., 5., 1.],
[ 8., nan, 3., nan, 7.]])
In [97]: arr[np.maximum.accumulate(np.isnan(arr),axis=1)] = np.nan
In [98]: arr
Out[98]:
array([[ 3., 5., nan, nan, nan],
[ 9., 1., 3., 5., 1.],
[ 8., nan, nan, nan, nan]])
基准测试
方法 -
def func1(arr):
arr[np.isnan(arr).cumsum(1)>0] = np.nan
def func2(arr):
arr[np.maximum.accumulate(np.isnan(arr),axis=1)] = np.nan
def func3(arr):
mask = np.isnan(arr);
accmask = np.cumsum(mask, out=mask, axis=1);
arr[accmask] = np.nan
def func4(arr):
mask = np.isnan(arr);
np.maximum.accumulate(mask,axis=1, out = mask)
arr[mask] = np.nan
def func5(arr):
n = arr.shape[1]
mask = np.isnan(arr)
idx = mask.argmax(1)
idx[~mask.any(1)] = n
arr[idx[:,None] <= np.arange(n)] = np.nan
时间 -
In [201]:
...: arr = np.random.rand(5000,5000)
...: arr.ravel()[np.random.choice(range(arr.size), 10000, replace=0)] = np.nan
...: arr1 = arr.copy()
...: arr2 = arr.copy()
...: arr3 = arr.copy()
...: arr4 = arr.copy()
...: arr5 = arr.copy()
...:
In [202]: %timeit func1(arr1)
...: %timeit func2(arr2)
...: %timeit func3(arr3)
...: %timeit func4(arr4)
...: %timeit func5(arr5)
...:
10 loops, best of 3: 149 ms per loop
10 loops, best of 3: 90.5 ms per loop
10 loops, best of 3: 88.8 ms per loop
10 loops, best of 3: 88.5 ms per loop
10 loops, best of 3: 75.3 ms per loop
基于广播的应用似乎做得很好!
mask = np.isnan(arr); accmask = np.cumsum(mask, out=mask, axis=1); arr[accmask] = np.nan
将会更快(也可能更节省内存) :-) - MSeifertfunc5
比func2
、func3
和func4
运行得慢。但鉴于竞争激烈,我添加了一个 Numba 解决方案,它比它们都要快 1.5 倍 :) - MSeifert