沿轴计算直方图

Question

沿轴计算直方图

pythonperformancenumpyscipyvectorization

7

有没有一种方法可以计算nD数组的一个轴上的许多直方图？我目前使用的方法是使用for循环迭代所有其他轴，并为每个结果为1D数组的轴计算numpy.histogram()：

import numpy
import itertools
data = numpy.random.rand(4, 5, 6)

# axis=-1, place `200001` and `[slice(None)]` on any other position to process along other axes
out = numpy.zeros((4, 5, 200001), dtype="int64")
indices = [
    numpy.arange(4), numpy.arange(5), [slice(None)]
]

# Iterate over all axes, calculate histogram for each cell
for idx in itertools.product(*indices):
    out[idx] = numpy.histogram(
        data[idx],
        bins=2 * 100000 + 1,
        range=(-100000 - 0.5, 100000 + 0.5),
    )[0]

out.shape  # (4, 5, 200001)

不用说，这非常慢，然而我找不到使用 numpy.histogram, numpy.histogram2d 或者 numpy.histogramdd 解决这个问题的方法。

- Nils Werner

1

请参考 https://dev59.com/DlkS5IYBdhLWcg3wVlXC，但我不确定这是否比循环更快。 - MB-F

1

它们的速度大致相同（但沿轴应用更易于阅读）。 - Nils Werner

请检查您是否可以使用 np.histogramdd。 - MaxU - stand with Ukraine

1

我不能这样做。它是为解决不同的问题而设计的。 - Nils Werner

2个回答

3

第一个解决方案提供了一个很好的短语，它使用了numpy中的sortedsearch，这需要进行排序和多次搜索。但是numpy在其源代码中有一条快速路径，实际上是用Python完成的，可以在数学上处理相等的bin边缘范围。该解决方案仅使用向量减法和乘法以及一些比较运算。

此解决方案将遵循numpy代码的搜索排序、类型输入和处理权重以及复杂数字。它基本上是第一个解决方案与numpy直方图快速路线、一些额外的类型和迭代细节等的结合。

_range = range
def hist_np_laxis(a, bins=10, range=None, weights=None):
    # Initialize empty histogram
    N = a.shape[-1]
    data2D = a.reshape(-1,N)
    limit = bins*data2D.shape[0]
    # gh-10322 means that type resolution rules are dependent on array
    # shapes. To avoid this causing problems, we pick a type now and stick
    # with it throughout.
    bin_type = np.result_type(range[0], range[1], a)
    if np.issubdtype(bin_type, np.integer):
        bin_type = np.result_type(bin_type, float)
    bin_edges = np.linspace(range[0],range[1],bins+1, endpoint=True, dtype=bin_type)
    # Histogram is an integer or a float array depending on the weights.
    if weights is None:
        ntype = np.dtype(np.intp)
    else:
        ntype = weights.dtype
    n = np.zeros(limit, ntype)
    # Pre-compute histogram scaling factor
    norm = bins / (range[1] - range[0])
    # We set a block size, as this allows us to iterate over chunks when
    # computing histograms, to minimize memory usage.
    BLOCK = 65536
    # We iterate over blocks here for two reasons: the first is that for
    # large arrays, it is actually faster (for example for a 10^8 array it
    # is 2x as fast) and it results in a memory footprint 3x lower in the
    # limit of large arrays.
    for i in _range(0, data2D.shape[0], BLOCK):
        tmp_a = data2D[i:i+BLOCK]
        block_size = tmp_a.shape[0]
        if weights is None:
            tmp_w = None
        else:
            tmp_w = weights[i:i + BLOCK]
        # Only include values in the right range
        keep = (tmp_a >= range[0])
        keep &= (tmp_a <= range[1])
        if not np.logical_and.reduce(np.logical_and.reduce(keep)):
            tmp_a = tmp_a[keep]
            if tmp_w is not None:
                tmp_w = tmp_w[keep]
        # This cast ensures no type promotions occur below, which gh-10322
        # make unpredictable. Getting it wrong leads to precision errors
        # like gh-8123.
        tmp_a = tmp_a.astype(bin_edges.dtype, copy=False)

        # Compute the bin indices, and for values that lie exactly on
        # last_edge we need to subtract one
        f_indices = (tmp_a - range[0]) * norm
        indices = f_indices.astype(np.intp)
        indices[indices == bins] -= 1

        # The index computation is not guaranteed to give exactly
        # consistent results within ~1 ULP of the bin edges.
        decrement = tmp_a < bin_edges[indices]
        indices[decrement] -= 1
        # The last bin includes the right edge. The other bins do not.
        increment = ((tmp_a >= bin_edges[indices + 1])
                     & (indices != bins - 1))
        indices[increment] += 1

        ((bins*np.arange(i, i+block_size)[:,None] * keep)[keep].reshape(indices.shape) + indices).reshape(-1)
        #indices = scaled_idx.reshape(-1)
        # We now compute the histogram using bincount
        if ntype.kind == 'c':
            n.real += np.bincount(indices, weights=tmp_w.real,
                                  minlength=limit)
            n.imag += np.bincount(indices, weights=tmp_w.imag,
                                  minlength=limit)
        else:
            n += np.bincount(indices, weights=tmp_w,
                             minlength=limit).astype(ntype)
    n.shape = a.shape[:-1] + (bins,)
    return n

data = np.random.randn(4, 5, 6)
out1 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
out2 = hist_np_laxis(data, bins=200001, range=(- 2.5, 2.5))
print(np.allclose(out1, out2))
True
%timeit hist_np_laxis(data, bins=21, range=(- 2.5, 2.5))
92.1 µs ± 504 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
55.1 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

虽然第一个解决方案在小例子和大例子中都更快：

data = np.random.randn(400, 500, 6)
%timeit hist_np_laxis(data, bins=21, range=(- 2.5, 2.5))
264 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
71.6 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

这并不总是更快：

data = np.random.randn(400, 6, 500)

%timeit hist_np_laxis(data, bins=101, range=(- 2.5, 2.5))
71.5 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit hist_laxis(data, n_bins=101, range_limits=(- 2.5, 2.5))
76.9 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

然而，仅当最后一个轴很大时，numpy的变体才更快。而且增速非常微小。在我尝试的所有其他情况中，无论bin计数和前两个维度的大小如何，第一种解决方案都要快得多。唯一重要的行 ((bins*np.arange(i, i+block_size)[:,None] * keep)[keep].reshape(indices.shape) + indices).reshape(-1) 可能更容易优化，尽管我还没有找到更快的方法。

这也意味着O(n)的向量化操作的数量超过了排序和重复增量搜索的O(n log n)。

但是，实际使用情况将具有具有大量数据的最后一个轴和前面的轴很少的情况。因此，在现实中，第一种解决方案中的样本过于人为，无法满足所需的性能。

numpy仓库中已经注意到了直方图的轴添加问题: https://github.com/numpy/numpy/issues/13166。

一个xhistogram库也试图解决这个问题：https://xhistogram.readthedocs.io/en/latest/。

- Gregory Morse

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Divakar · Accepted Answer

这里提供了一种矢量化的方法，利用高效工具np.searchsorted和np.bincount。 searchsorted根据区间给出每个元素应放置的位置，bincount则为我们计数。

实现 -

def hist_laxis(data, n_bins, range_limits):
    # Setup bins and determine the bin location for each element for the bins
    R = range_limits
    N = data.shape[-1]
    bins = np.linspace(R[0],R[1],n_bins+1)
    data2D = data.reshape(-1,N)
    idx = np.searchsorted(bins, data2D,'right')-1

    # Some elements would be off limits, so get a mask for those
    bad_mask = (idx==-1) | (idx==n_bins)

    # We need to use bincount to get bin based counts. To have unique IDs for
    # each row and not get confused by the ones from other rows, we need to 
    # offset each row by a scale (using row length for this).
    scaled_idx = n_bins*np.arange(data2D.shape[0])[:,None] + idx

    # Set the bad ones to be last possible index+1 : n_bins*data2D.shape[0]
    limit = n_bins*data2D.shape[0]
    scaled_idx[bad_mask] = limit

    # Get the counts and reshape to multi-dim
    counts = np.bincount(scaled_idx.ravel(),minlength=limit+1)[:-1]
    counts.shape = data.shape[:-1] + (n_bins,)
    return counts

运行时测试

原始方法 -

def org_app(data, n_bins, range_limits):
    R = range_limits
    m,n = data.shape[:2]
    out = np.zeros((m, n, n_bins), dtype="int64")
    indices = [
        np.arange(m), np.arange(n), [slice(None)]
    ]

    # Iterate over all axes, calculate histogram for each cell
    for idx in itertools.product(*indices):
        out[idx] = np.histogram(
            data[idx],
            bins=n_bins,
            range=(R[0], R[1]),
        )[0]
    return out

时间和验证 -

In [2]: data = np.random.randn(4, 5, 6)
   ...: out1 = org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5))
   ...: out2 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
   ...: print np.allclose(out1, out2)
   ...: 
True

In [3]: %timeit org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5))
10 loops, best of 3: 39.3 ms per loop

In [4]: %timeit hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
100 loops, best of 3: 3.17 ms per loop

由于在循环解决方案中，我们正在遍历前两个轴。因此，让我们增加它们的长度，这将展示向量化解决方案有多好 -

In [59]: data = np.random.randn(400, 500, 6)

In [60]: %timeit org_app(data, n_bins=21, range_limits=(- 2.5, 2.5))
1 loops, best of 3: 9.59 s per loop

In [61]: %timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
10 loops, best of 3: 44.2 ms per loop

In [62]: 9590/44.2          # Speedup number
Out[62]: 216.9683257918552