扩展numpy掩码

Question

扩展numpy掩码

3

我想用mask来掩盖一个numpy数组a。这个掩码的形状与a不完全相同，但是仍然可以掩盖a（我猜是因为多了一个维度是一维的（广播？））。

a.shape
>>> (3, 9, 31, 2, 1)
mask.shape
>>> (3, 9, 31, 2)
masked_a = ma.masked_array(a, mask)

然而，同样的逻辑并不适用于数组b，因为它在其最后一个维度中有5个元素。

ext_mask = mask[..., np.newaxis] # extending or not extending has same effect
ext_mask.shape
>>> (3, 9, 31, 2, 1)

b.shape
>>> (3, 9, 31, 2, 5)
masked_b = ma.masked_array(b, ext_mask)
>>> numpy.ma.core.MaskError: Mask and data not compatible: data size is 8370, mask size is 1674.

我该如何通过将 (3, 9, 31, 2) 掩码中的最后一个维度中的任何 True 值扩展为 [True, True, True, True, True]（相应地，是 False），以创建一个 (3, 9, 31, 2, 5) 掩码？

- orange

这个代码是有效的：masked_b = ma.masked_array(*np.broadcast(b, ext_mask))，但我不知道为什么ma.masked_array没有自动广播。编辑：也许是因为它只想为了效率而存储两个大小相等的数组的视图？ - MB-F

这会导致 TypeError: __new__() 最多接受 11 个参数（已提供 8371 个）。 - orange

1

哦，抱歉，我的错！“broadcast”是错误的函数。你需要使用“broadcast_arrays”。 - MB-F

1

文档说明broadcast_arrays返回到原始数组的视图，这意味着不会执行任何分配。 - MB-F

1

是的，我会写一个答案，但首先我要对这个主题进行更多的研究 :) - MB-F

显示剩余3条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MB-F · Accepted Answer

这将得到所需的结果：

masked_b = ma.masked_array(*np.broadcast(b, ext_mask))

我没有对这种方法进行过分析，但它应该比分配一个新的掩码更快。根据文档，不会复制任何数据：

这些数组是原始数组的视图。它们通常不是连续的。此外，广播数组的一个以上元素可能指向单个内存位置。如果需要向数组写入内容，请先进行复制。

可以验证没有复制行为：

bb, mb = np.broadcast(b, ext_mask)
print(mb.shape)       # (3, 9, 31, 2, 5) - same shape as b
print(mb.base.shape)  # (3, 9, 31, 2) - the shape of the original mask
print(mb.strides)     # (558, 62, 2, 1, 0) - that's how it works: 0 stride

很惊人的是numpy开发者如何实现广播。通过在最后一个维度上使用步幅为0来重复值。哇！

编辑

我使用以下代码比较了广播和分配的速度：

import numpy as np
from numpy import ma

a = np.random.randn(30, 90, 31, 2, 1)
b = np.random.randn(30, 90, 31, 2, 5)

mask = np.random.randn(30, 90, 31, 2) > 0
ext_mask = mask[..., np.newaxis]

def broadcasting(a=a, b=b, ext_mask=ext_mask):
    mb1 = ma.masked_array(*np.broadcast_arrays(b, ext_mask))

def allocating(a=a, b=b, ext_mask=ext_mask):
    m2 = np.empty(b.shape, dtype=bool)
    m2[:] = ext_mask
    mb2 = ma.masked_array(b, m2)

广播显然比分配快，这里是例子：

    # array size: (30, 90, 31, 2, 5)

In [23]: %timeit broadcasting()
The slowest run took 10.39 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 39.4 µs per loop

In [24]: %timeit allocating()
The slowest run took 4.86 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 982 µs per loop

请注意，我必须增加数组大小才能使速度差异变得明显。原始数组维度时，分配略快于广播：

    # array size: (3, 9, 31, 2, 5)

In [28]: %timeit broadcasting()
The slowest run took 9.36 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 39 µs per loop

In [29]: %timeit allocating()
The slowest run took 9.22 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.6 µs per loop

广播解决方案的运行时似乎不取决于数组大小。