numpy.sum和numpy.cumsum之间的性能差异是什么原因？

Question

numpy.sum和numpy.cumsum之间的性能差异是什么原因？

3

如果我的理解是正确的，无论是np.sum还是np.cumsum都需要O(n)的时间。然而，当我在同一矩阵上按顺序执行这两个操作（在不同的轴上），尽管结果如预期，但np.sum和np.cumsum的顺序似乎使整体性能有所不同。

如果我首先沿着列方向（axis=1）对每一行执行np.cumsum，然后对所有行（axis=0）执行 np.sum，需要更长的时间。

如果我首先沿着行方向（axis=0）执行np.sum，然后对一维数组执行np.cumsum，需要更短的时间。

我的假设是，由于np.cumsum产生的数据比np.sum多，因此在有更多np.cumsum操作时需要更多的数据分配/操作时间，所以需要更长的时间。

以下是我的测试代码和结果

import numpy as np
import time

b = np.zeros((1000, 1000))
for i in range(1000):
    b[i] = np.array(range(1000))

time_start = time.time()
for i in range(1000):
    c = np.cumsum(b, axis=1)
    d = np.sum(c, axis=0)
time_end = time.time()
print(f"np.sum(np.cumsum(...)) time: {time_end - time_start}")


time_start = time.time()
for i in range(1000):
    c = np.cumsum(np.sum(b, axis=0))
time_end = time.time()
print(f"np.cumsum(np.sum(...)) time: {time_end - time_start}")

结果是：

np.sum(np.cumsum(...)) time: 3.6612446308135986
np.cumsum(np.sum(...)) time: 0.38796162605285645

- Yunzhao Xing

在你的第一个循环中，变量 c 的形状为(1000, 1000)，和 b 相同，所以你正在计算一百万个累积和，然后计算长度为1000的向量的1000个总和。在第二个循环中，你先计算1000个总和，然后只计算1000个累积和。 - Andrew Eckart

2个回答

1

如果我的理解正确，np.sum和np.cumsum都需要O(n)的时间。

正确（假设输入相同）！

似乎np.sum和np.cumsum的顺序会使整体性能有很大差异。

在相同的输入上，np.cumsum比np.sum更昂贵，因为np.cumsum需要分配并写入一个输出数组（可能非常大）。此外，假设浮点运算是可结合的，np.sum的实现可以很容易地优化，而np.cumsum则很难优化。

我的假设是，np.cumsum在数据分配/操作方面需要更多的时间，因为它会产生比np.sum更多的数据，所以如果有更多的np.cumsum操作，它将需要更长的时间。

问题在于第一个实现要做的工作比第二个多得多。这主要是它较慢的原因，特别是由于内存读/写。事实上，np.cumsum 在第一个实现中生成一个大的二维数组，必须由 np.sum 计算。 编写和读取此临时数组是昂贵的。在示例中，第一个实现需要将 14.9 GiB 的数据从/到内存层次结构（可能在 RAM 中）移动，仅用于临时数组。请注意，构建临时数组也使计算不那么 高速缓存友好，因为它需要更多的空间，因此可能不再同时适合b 和 c（我的机器上每个占用8 MiB，我的 CPU 具有 9 MiB 的 L3 缓存，这意味着不能同时适合两者），而第二个实现则相反。RAM 的吞吐量通常比 CPU 缓存的吞吐量小得多。

请注意，第二个实现也会产生临时数组。这个版本应该像第一个版本一样分配尽可能多的临时数组。但是，第二个实现产生的临时数组要小1000倍！因此，在这个版本中调用np.cumsum速度更快。在我的机器上，b主要存储在快速L3缓存中，临时数组存储在非常快速的L1缓存中。

- Jérôme Richard

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hpaulj · Accepted Answer

看看你在小数组中应用的备选方案。

In [45]: b = np.arange(24).reshape(4,6)
In [46]: b
Out[46]: 
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23]])

sum(cumsum(b)):

In [47]: np.cumsum(b, axis=1)         # same shape as b
Out[47]: 
array([[  0,   1,   3,   6,  10,  15],
       [  6,  13,  21,  30,  40,  51],
       [ 12,  25,  39,  54,  70,  87],
       [ 18,  37,  57,  78, 100, 123]])
In [48]: np.sum(_, axis=0)
Out[48]: array([ 36,  76, 120, 168, 220, 276])

求前 n 个数的和：

In [49]: np.sum(b, axis=0)            # much reduced array
Out[49]: array([36, 40, 44, 48, 52, 56])
In [50]: np.cumsum(_)
Out[50]: array([ 36,  76, 120, 168, 220, 276])

第一步操作的是完整的数组 b，包含相同数量的元素，而第二步在对 sum 进行计算时使用了大大缩减的数组。

实际上，对于这个小样本，时间差异不大：

In [57]: timeit np.sum(np.cumsum(b,axis=1), axis=0)
17.8 µs ± 29.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [58]: timeit np.cumsum(np.sum(b, axis=0))
16.1 µs ± 30.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

当数组更大时，由于减少而导致的时间节约变得显著：

In [59]: b = np.random.randint(0,1000,(1000,1000))
In [60]: timeit np.sum(np.cumsum(b,axis=1), axis=0)
5.11 ms ± 9.99 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [61]: timeit np.cumsum(np.sum(b, axis=0))
1.64 ms ± 649 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

仅看第一步：

In [62]: timeit np.cumsum(b,axis=1)
3.73 ms ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [63]: timeit np.sum(b, axis=0)
1.62 ms ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

cumsum因为返回完整的b.shape数组，所以速度较慢。

O(n)告诉我们有关操作如何扩展的信息，例如在小数组和大数组上操作时。但它不能用于比较一个操作和另一个操作。此外，在numpy中，核心计算是在编译代码中快速完成的，但内存管理和其他设置问题可能会淹没核心时间。

编辑

np.sum本质上是np.add.reduce：

In [79]: timeit np.add.reduce(b,axis=0)
1.59 ms ± 4.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [80]: timeit np.sum(b,axis=0)
1.64 ms ± 5.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

np.cumsum 是 np.add.accumulate 的简写：

In [81]: timeit np.cumsum(b,axis=0)
10.2 ms ± 357 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [82]: timeit np.add.accumulate(b,axis=0)
10 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)