如何将循环遍历3D点数组的Python函数向量化?

4
这里是代码:
import numpy as np
from numpy.random import random

@profile
def point_func(point, points, funct):
    return np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))

@profile
def point_afunc(ipoints, epoints, funct):
    res = np.zeros(len(ipoints))
    for idx, point in enumerate(ipoints):
        res[idx] = point_func(point, epoints, funct)
    return res

@profile
def main():
    points = random((5000,3))
    rpoint = random((1,3))

    pres = point_func(rpoint, points, lambda r : r**3)

    ares = point_afunc(points, points, lambda r : r**3)

if __name__=="__main__":
    main()

我使用对其进行了分析,得到了以下结果:

Timer unit: 1e-06 s

Total time: 2.25667 s File: point-array-vectorization.py Function: point_func at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           @profile
     5                                           def point_func(point, points, funct):
     6      5001    2256667.0    451.2    100.0      return np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))

Total time: 2.27844 s File: point-array-vectorization.py Function: point_afunc at line 8

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           @profile
     9                                           def point_afunc(ipoints, epoints, funct):
    10         1          5.0      5.0      0.0      res = np.zeros(len(ipoints))
    11      5001       4650.0      0.9      0.2      for idx, point in enumerate(ipoints):
    12      5000    2273789.0    454.8     99.8          res[idx] = point_func(point, epoints, funct)
    13         1          0.0      0.0      0.0      return res

Total time: 2.28239 s File: point-array-vectorization.py Function: main at line 15

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    15                                           @profile
    16                                           def main():
    17         1        145.0    145.0      0.0      points = random((5000,3))
    18         1          2.0      2.0      0.0      rpoint = random((1,3))
    19                                           
    20         1        507.0    507.0      0.0      pres = point_func(rpoint, points, lambda r : r**3)
    21                                           
    22         1    2281731.0 2281731.0    100.0      ares = point_afunc(points, points, lambda r : r**3)

所以这部分占用了大部分时间:

11      5001       4650.0      0.9      0.2      for idx, point in enumerate(ipoints):
    12      5000    2273789.0    454.8     99.8          res[idx] = point_func(point, epoints, funct)

我想确定是否因为在 for 循环中调用 funct 导致时间损失。为了做到这一点,我想要使用 numpy.vectorizepoint_afunc 向量化。我已经尝试过了,但它似乎将点向量化掉了:循环最终遍历单个点组件。

@profile
def point_afunc(ipoints, epoints, funct):
    res = np.zeros(len(ipoints))
    for idx, point in enumerate(ipoints):
        res[idx] = point_func(point, epoints, funct)
    return res

point_afunc = np.vectorize(point_afunc)

导致错误:
  File "point-array-vectorization.py", line 24, in main
    ares = point_afunc(points, points, lambda r : r**3)
  File "/usr/lib/python3.6/site-packages/numpy/lib/function_base.py", line 2755, in __call__
    return self._vectorize_call(func=func, args=vargs)
  File "/usr/lib/python3.6/site-packages/numpy/lib/function_base.py", line 2825, in _vectorize_call
    ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args)
  File "/usr/lib/python3.6/site-packages/numpy/lib/function_base.py", line 2785, in _get_ufunc_and_otypes
    outputs = func(*inputs)
  File "/usr/lib/python3.6/site-packages/line_profiler.py", line 115, in wrapper
    result = func(*args, **kwds)
  File "point-array-vectorization.py", line 10, in point_afunc
    res = np.zeros(len(ipoints))
TypeError: object of type 'numpy.float64' has no len()

不知何故,它没有将ipoints中的每个向量化,而是在点的组成部分之间进行了向量化?

编辑:尝试了@John Zwinck的建议并使用了numba。我发现使用@jit的执行时间比不使用更长。如果我从所有函数中删除@profile装饰器,并将其替换为point_funcpoint_afunc@jit,则执行时间如下:

time ./point_array_vectorization.py 

real    0m3.686s
user    0m3.584s
sys 0m0.077s
point-array-vectorization> time ./point_array_vectorization.py 

real    0m3.683s
user    0m3.596s
sys 0m0.063s
point-array-vectorization> time ./point_array_vectorization.py 

real    0m3.751s
user    0m3.658s
sys 0m0.070s

并且移除了所有的@jit装饰器:

point-array-vectorization> time ./point_array_vectorization.py 

real    0m2.925s
user    0m2.874s
sys 0m0.030s
point-array-vectorization> time ./point_array_vectorization.py 

real    0m2.950s
user    0m2.902s
sys 0m0.029s
point-array-vectorization> time ./point_array_vectorization.py 

real    0m2.951s
user    0m2.886s
sys 0m0.042s

我需要更多地帮助numba编译器吗?

编辑: 可以使用numpy在不使用for循环的情况下编写 point_afunc 吗?

编辑: 通过Peter的广播版本将循环版本进行比较,循环版本更快

Timer unit: 1e-06 s

Total time: 2.13361 s
File: point_array_vectorization.py
Function: point_func at line 7

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     7                                           @profile
     8                                           def point_func(point, points, funct):
     9      5001    2133615.0    426.6    100.0      return np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))

Total time: 2.1528 s
File: point_array_vectorization.py
Function: point_afunc at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def point_afunc(ipoints, epoints, funct):
    13         1          5.0      5.0      0.0      res = np.zeros(len(ipoints))
    14      5001       4176.0      0.8      0.2      for idx, point in enumerate(ipoints):
    15      5000    2148617.0    429.7     99.8          res[idx] = point_func(point, epoints, funct)
    16         1          0.0      0.0      0.0      return res

Total time: 2.75093 s
File: point_array_vectorization.py
Function: new_point_afunc at line 18

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    18                                           @profile
    19                                           def new_point_afunc(ipoints, epoints, funct):
    20         1    2750926.0 2750926.0    100.0      return np.sum(funct(np.sqrt((ipoints[:, None, :] - epoints[None, :, :])**2).sum(axis=-1)), axis=1)

Total time: 4.90756 s
File: point_array_vectorization.py
Function: main at line 22

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    22                                           @profile
    23                                           def main():
    24         1        170.0    170.0      0.0      points = random((5000,3))
    25         1          4.0      4.0      0.0      rpoint = random((1,3))
    26         1        546.0    546.0      0.0      pres = point_func(rpoint, points, lambda r : r**3)
    27         1    2155829.0 2155829.0     43.9      ares = point_afunc(points, points, lambda r : r**3)
    28         1    2750945.0 2750945.0     56.1      vares = new_point_afunc(points, points, lambda r : r**3)
    29         1         71.0     71.0      0.0      assert(np.max(np.abs(ares-vares)) < 1e-15)

1
你的意图是在每次调用时提供另一个函数(在你的情况下是 lambda r : r**3)吗?这将在 Numba 中产生显着的编译开销。 - max9111
@max9111:是的,这些函数将会不同,但我想要使用的函数集合是有限的。 - tmaric
1
将这个有限的函数集提前编译或使用缓存可能会很有用。这个函数集有多大? - max9111
@max9111:谢谢!大概有15个函数。 - tmaric
3个回答

3

numpy.vectorize() 没有任何实际的性能效益:它只是一种语法糖(或者更确切地说,是一种语法毒药),它构建了一个隐藏的Python for循环。 它对你没有帮助。

有一件事可能会对你非常有帮助,那就是使用Numba。它可以即时编译你的原始代码,并且很可能大大加快速度。只需将你的@profile装饰器替换为@numba.jit


numpy.vectorize难道不是让一个函数能够在循环中执行numpy.ndarray不需要调用解释器的吗? - tmaric
3
绝对不行。文档明确说明:“矢量化函数主要是为了方便而提供的,而不是为了性能。其实现本质上是一个for循环。” - John Zwinck
好的,这与numba向量化不同。谢谢! - tmaric

1
您可以使用广播技术来实现。广播是将点矩阵重新塑形,以便各个维度进行"广播"。例如:ipoints[:, None, :] - epoints[None, :, :] 表示:

  • ipoints 从 MxD 重塑为 Mx1xD
  • epoints 从 NxD 重塑为 1xNxD
  • 减去每对点以得到一个 MxNxD 数组

完整代码如下:

def new_point_afunc(ipoints, epoints, funct):
    return np.sum(funct(np.sqrt((ipoints[:, None, :] - epoints[None, :, :])**2).sum(axis=-1)), axis=1)

警告- 在您的示例中,点的维度仅为3,但对于更高的维度,这可能在内存方面不太实用,因为此ipoints[:, None, :] - epoints[None, :, :]方法会创建一个形状为(len(ipoints), len(epoints), n_dim)的中间矩阵。


我刚刚使用你的版本运行了 kernprof,似乎比带有循环的那个版本要慢?我已经编辑了问题。感谢你的帮助! - tmaric
1
这有点令人惊讶,但很合理 - 因为这种“完全矢量化”的方法需要为中间减法操作分配更多的内存。 - Peter

1

Numba示例

此方法的性能取决于有多少次需要调用所创建的函数以及输入数据的大小。由于编译开销约为1.67秒,因此在处理相对较小的数据或仅调用函数一次时,使用此方法并不太适合。

我还使用了您的代码进行了轻微修改。使用Numba编写简单循环而不是多个矢量化命令(如np.sum(funct(np.sqrt(((point - points)**2)).sum(1))))将更快,无论是运行时还是编译时间。此外,这个问题可以很容易地并行化,但这将进一步增加编译时间。

示例

import numpy as np
from numpy.random import random
import numba as nb
import time

def make_point_func(funct):
    @nb.njit(fastmath=True)
    def point_func(point, points):
        return np.sum(funct(np.sqrt(((point - points)**2)).sum(1)))
    return point_func

def make_point_afunc(funct,point_func):
    @nb.njit(fastmath=True)
    def point_afunc(ipoints, epoints):
        res = np.zeros(len(ipoints))
        for idx in range(len(ipoints)):
            res[idx] = point_func(ipoints[idx], epoints)
        return res
    return point_afunc


def main():
  points = random((5000,3))
  rpoint = random((1,3))

  #Make functions
  point_func=make_point_func(nb.njit(lambda r : r**3))
  point_afunc=make_point_afunc(nb.njit(lambda r : r**3),point_func)

  #first call
  print("First call with compilation overhead")
  t1=time.time()
  pres = point_func(rpoint, points)
  ares = point_afunc(points, points)
  print(time.time()-t1)

  print("Second call without compilation overhead")
  t1=time.time()
  #second call
  pres = point_func(rpoint, points)
  ares = point_afunc(points, points)
  print(time.time()-t1)

if __name__=="__main__":
  main()

Performance

original: 1.7s
Numba first call: 1.87s
Numba further calls: 0.2s

将函数包装成“make_”函数的原因是什么?我尝试直接将“njit”装饰器添加到函数中,但是“numba”无法编译它,因为它无法推断出“point_func”的第三个参数(无法确定<class'function'>的Numba类型)。这种策略是简单地避免推断导致问题的此类类型,并将其余部分包装起来,以便“numba”可以处理它吗? - tmaric
1
无法直接将函数传递给已编译的函数。https://numba.pydata.org/numba-doc/dev/user/faq.html - max9111
谢谢!我已经看了文档,但没有找到这个问题的解决方法。我是numba的新手,需要处理大量的信息。 - tmaric

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接