在numpy数组中向前填充NaN值的最有效方法

Question

在numpy数组中向前填充NaN值的最有效方法

80

示例问题

作为一个简单的例子，考虑下面定义的numpy数组arr：

import numpy as np
arr = np.array([[5, np.nan, np.nan, 7, 2],
                [3, np.nan, 1, 8, np.nan],
                [4, 9, 6, np.nan, np.nan]])

arr 在控制台输出中看起来是这样的：

array([[  5.,  nan,  nan,   7.,   2.],
       [  3.,  nan,   1.,   8.,  nan],
       [  4.,   9.,   6.,  nan,  nan]])

我现在想要对数组arr进行逐行的“向前填充”操作，即将每个nan值替换为它左边最近的有效值。期望的结果如下：

array([[  5.,   5.,   5.,  7.,  2.],
       [  3.,   3.,   1.,  8.,  8.],
       [  4.,   9.,   6.,  6.,  6.]])

已尝试的方法

我已经尝试使用for循环：

for row_idx in range(arr.shape[0]):
    for col_idx in range(arr.shape[1]):
        if np.isnan(arr[row_idx][col_idx]):
            arr[row_idx][col_idx] = arr[row_idx][col_idx - 1]

我也尝试使用Pandas数据框作为中间步骤（因为Pandas数据框有一个非常整洁的内置方法来进行向前填充）：

import pandas as pd
df = pd.DataFrame(arr)
df.fillna(method='ffill', axis=1, inplace=True)
arr = df.as_matrix()

以上两种方法都可以得到所需结果，但我一直在想：是否有一种只使用numpy向量化操作的策略是最有效的呢？

摘要

在numpy数组中，是否有另外一种更高效的方法来“向前填充”nan值？（例如，使用numpy向量化操作）

更新：解决方案比较

我尝试对所有迄今为止的解决方案进行计时。这是我的设置脚本：

import numba as nb
import numpy as np
import pandas as pd

def random_array():
    choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]
    out = np.random.choice(choices, size=(1000, 10))
    return out

def loops_fill(arr):
    out = arr.copy()
    for row_idx in range(out.shape[0]):
        for col_idx in range(1, out.shape[1]):
            if np.isnan(out[row_idx, col_idx]):
                out[row_idx, col_idx] = out[row_idx, col_idx - 1]
    return out

@nb.jit
def numba_loops_fill(arr):
    '''Numba decorator solution provided by shx2.'''
    out = arr.copy()
    for row_idx in range(out.shape[0]):
        for col_idx in range(1, out.shape[1]):
            if np.isnan(out[row_idx, col_idx]):
                out[row_idx, col_idx] = out[row_idx, col_idx - 1]
    return out

def pandas_fill(arr):
    df = pd.DataFrame(arr)
    df.fillna(method='ffill', axis=1, inplace=True)
    out = df.as_matrix()
    return out

def numpy_fill(arr):
    '''Solution provided by Divakar.'''
    mask = np.isnan(arr)
    idx = np.where(~mask,np.arange(mask.shape[1]),0)
    np.maximum.accumulate(idx,axis=1, out=idx)
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    return out

接下来是控制台的输入:

%timeit -n 1000 loops_fill(random_array())
%timeit -n 1000 numba_loops_fill(random_array())
%timeit -n 1000 pandas_fill(random_array())
%timeit -n 1000 numpy_fill(random_array())

导致此控制台输出：

1000 loops, best of 3: 9.64 ms per loop
1000 loops, best of 3: 377 µs per loop
1000 loops, best of 3: 455 µs per loop
1000 loops, best of 3: 351 µs per loop

- Xukrao

4

如果一行中的第一个元素是 nan，应该发生什么？ - Tadhg McDonald-Jensen

在这种情况下，pandas不会改变NaN。我认为OP希望保持一致的行为。 - DYZ

4

将 1 维 numpy 数组中的零值替换为最后一个非零值。将 1 维 numpy 数组中的零值用最后一个非零值填充。 - blacksite

1

顺便提一下，甚至不需要调用 as_matrix()：原始的 arr 已经被更改了。 - DYZ

我正在寻找三维数组的解决方案，对于二维数组来说，最愚蠢的方法是先将其转换为df，然后使用fillna。 - Tommy Yu

显示剩余5条评论

13个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andrew · Answer 1

通过轴选择和“backward”支持，对RichieV通用纯numpy解决方案进行了小的改进

def _np_fill_(arr, axis=-1, fill_dir='f'):
    """Base function for np_fill, np_ffill, np_bfill."""
    if axis < 0:
        axis = len(arr.shape) + axis
    
    if fill_dir.lower() in ['b', 'backward']:
        dir_change = tuple([*[slice(None)]*axis, slice(None, None, -1)])
        return np_ffill(arr[dir_change])[dir_change]
    elif fill_dir.lower() not in ['f', 'forward']:
        raise KeyError(f"fill_dir must be one of: 'b', 'backward', 'f', 'forward'. Got: {fill_dir}")
    
    idx_shape = tuple([slice(None)] + [np.newaxis] * (len(arr.shape) - axis - 1))
    idx = np.where(~np.isnan(arr), np.arange(arr.shape[axis])[idx_shape], 0)
    np.maximum.accumulate(idx, axis=axis, out=idx)
    slc = [np.arange(k)[tuple([slice(None) if dim==i else np.newaxis
        for dim in range(len(arr.shape))])]
        for i, k in enumerate(arr.shape)]
    slc[axis] = idx
    return arr[tuple(slc)]

def np_fill(arr, axis=-1, fill_dir='f'):
    """General fill function which supports multiple filling steps. I.e.: 
    fill_dir=['f', 'b'] or fill_dir=['b', 'f']"""
    if isinstance(fill_dir, (tuple, list, np.ndarray)):
        for i in fill_dir:
            arr = _np_fill_(arr, axis=axis, fill_dir=i)
    else:
        arr = _np_fill_(arr, axis=axis, fill_dir=fill_dir)
    return arr

def np_ffill(arr, axis=-1):
    return np_fill(arr, axis=axis, fill_dir='forward')

def np_bfill(arr, axis=-1):
    return np_fill(arr, axis=axis, fill_dir='backward')

- LearnToGrow · Answer 2

除非我漏掉了什么，否则这些解决方案在任何示例上都不起作用：

arr  = np.array([[ 3.],
 [ 8.],
 [np.nan],
 [ 7.],
 [np.nan],
 [ 1.],
 [np.nan],
 [ 3.],
 [ 8.],
 [ 8.]])
print("A:::: \n", arr)

print("numpy_fill::: \n ",  numpy_fill(arr))
print("loop_fill",  loops_fill(arr))

A:::: 
 [[ 3.]
 [ 8.]
 [nan]
 [ 7.]
 [nan]
 [ 1.]
 [nan]
 [ 3.]
 [ 8.]
 [ 8.]]
numpy_fill::: 
  [[ 3.]
 [ 8.]
 [nan]
 [ 7.]
 [nan]
 [ 1.]
 [nan]
 [ 3.]
 [ 8.]
 [ 8.]]
loop_fill [[ 3.]
 [ 8.]
 [nan]
 [ 7.]
 [nan]
 [ 1.]
 [nan]
 [ 3.]
 [ 8.]
 [ 8.]]

Comments ??

- Tan Phan · Answer 3

我使用了np.nan_to_num函数的例子：

data = np.nan_to_num(data, data.mean())

参考资料：Numpy文档