理解CUDA、Numba、Cupy等的扩展示例

Question

理解CUDA、Numba、Cupy等的扩展示例

pythongpunumbacupy

7

大部分Numba、CuPy等在线上的示例都是简单的数组加法，展示了从CPU单核/线程到GPU的加速效果。而命令文档大多缺乏好的实例。本篇文章旨在提供一个更全面的例子。

最初的代码可以在这里找到。它是经典细胞自动机的一个简单模型。最初它甚至没有使用numpy，只用纯python和pyglet模块进行可视化。

我的目标是将这个代码扩展到一个特定问题（将非常大），但首先我认为最好为GPU使用进行优化。

game_of_life.py 如下：

import random as rnd
import pyglet
#import numpy as np
#from numba import vectorize, cuda, jit

class GameOfLife: 
 
    def __init__(self, window_width, window_height, cell_size, percent_fill):
        self.grid_width = int(window_width / cell_size) # cell_size 
        self.grid_height = int(window_height / cell_size) # 
        self.cell_size = cell_size
        self.percent_fill = percent_fill
        self.cells = []
        self.generate_cells()
  
    def generate_cells(self):
        for row in range(0, self.grid_height): 
            self.cells.append([])
            for col in range(0, self.grid_width):
                if rnd.random() < self.percent_fill:
                    self.cells[row].append(1)
                else:
                    self.cells[row].append(0)
                
    def run_rules(self): 
        temp = []
        for row in range(0, self.grid_height):
            temp.append([])
            for col in range(0, self.grid_width):
                cell_sum = sum([self.get_cell_value(row - 1, col),
                                self.get_cell_value(row - 1, col - 1),
                                self.get_cell_value(row,     col - 1),
                                self.get_cell_value(row + 1, col - 1),
                                self.get_cell_value(row + 1, col),
                                self.get_cell_value(row + 1, col + 1),
                                self.get_cell_value(row,     col + 1),
                                self.get_cell_value(row - 1, col + 1)])
                
                if self.cells[row][col] == 0 and cell_sum == 3:
                    temp[row].append(1)
                elif self.cells[row][col] == 1 and (cell_sum == 3 or cell_sum == 2):
                    temp[row].append(1)
                else:                 
                    temp[row].append(0)
        
        self.cells = temp

    def get_cell_value(self, row, col): 
        if row >= 0 and row < self.grid_height and col >= 0 and col < self.grid_width:
           return self.cells[row][col]
        return 0

    def draw(self): 
        for row in range(0, self.grid_height):
            for col in range(0, self.grid_width):
                if self.cells[row][col] == 1:
                    #(0, 0) (0, 20) (20, 0) (20, 20)
                    square_coords = (row * self.cell_size,                  col * self.cell_size,
                                     row * self.cell_size,                  col * self.cell_size + self.cell_size,
                                     row * self.cell_size + self.cell_size, col * self.cell_size,
                                     row * self.cell_size + self.cell_size, col * self.cell_size + self.cell_size)
                    pyglet.graphics.draw_indexed(4, pyglet.gl.GL_TRIANGLES,
                                         [0, 1, 2, 1, 2, 3],
                                         ('v2i', square_coords))

首先，在generate_cells函数的末尾添加self.cells = np.asarray(self.cells)，在run_rules函数的末尾添加self.cells = np.asarray(temp)。因为在此之前进行这些操作不会提高速度，如所示。（实际上改用numpy并没有明显的加速）

关于GPU，例如，我在每个函数之前添加了@jit，结果变得非常慢。我还尝试过使用@vectorize(['float32(float32, float32)'], target='cuda')，但这引发了一个问题：如何在只有self作为输入参数的函数中使用@vectorize？

我还尝试将numpy替换为cupy，如self.cells = cupy.asarray(self.cells)，但也变得非常慢。

根据扩展GPU使用的初始想法，应该采用什么方法来解决问题？在哪里放置修改/矢量化/并行化/numba/cupy等？最重要的是，为什么？

附加信息：除了提供的代码外，这是main.py文件：

import pyglet
from game_of_life import GameOfLife 
 
class Window(pyglet.window.Window):
 
    def __init__(self):
        super().__init__(800,800)
        self.gameOfLife = GameOfLife(self.get_size()[0],
                                     self.get_size()[1],
                                     15,  # the lesser this value, more computation intensive will be
                                     0.5) 

        pyglet.clock.schedule_interval(self.update, 1.0/24.0) # 24 frames per second
 
    def on_draw(self):
        self.clear()
        self.gameOfLife.draw()
        
    def update(self, dt):
        self.gameOfLife.run_rules()
 
if __name__ == '__main__':
    window = Window()
    pyglet.app.run()

- rod_CAE

我对使用cuda.jit装饰器的理解非常有限，但我认为导致内核性能不佳的主要原因是在CPU和GPU之间传输过多的数据。为了避免这种情况，必须仅传递必要的变量，特别是在处理大型数组时。我认为通过将self作为每个函数（即内核）的参数，您可能会传递不必要的数据。此外，请记住，每个线程仅操作数组的单个元素，因此使用for迭代数组将无法并行化。希望这有所帮助。 - boi

@boi，感谢你指出这一点。我在三个月前开始学习Python，这是我第一次使用类。虽然我编程已经十多年了，但我从未使用过它，这对我来说是新的东西。像self、_init_等这样的东西对我来说都是新的。我会更仔细地研究如何正确地传递参数。关于for，你知道Python是否有类似于Matlab的parfor吗？ - rod_CAE

实际上，numba.prange 可能是您正在寻找的内容，尽管我认为在 numba.cuda 中并不可能并行化循环。这里是文档：https://numba.readthedocs.io/en/stable/user/parallel.html?highlight=prange。我对所有这些都是相当新的 :)。 - boi

@rod_CAE 这个有更新吗？我也能理解 class 结构的新颖性。 - Sterling

很遗憾，@Sterling。这个项目一度没有并行化进行，但目前已经暂停了，因为我的其他项目比这个更重要。 - rod_CAE

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- weidong · Accepted Answer

我不太理解你的例子，但我只需要GPU计算。经过几天的痛苦，我可能会理解它的用法，所以我会向你展示它，希望能帮助你。此外，我需要指出，在使用“...kernel(cuts, cuts”时，我会放两个。因为第一个指定了传递时的类型，它将被核心用作遍历元素，不能通过索引读取。所以我使用第二个来计算自由索引数据。

```
binsort_kernel = cp.ElementwiseKernel(
'int32 I,raw T cut,raw T ind,int32 row,int32 col,int32 q','raw T out,raw T bin,raw T num',    
'''
int i_x = i / col;                
int i_y = i % col;                
int b_f = i_x*col;                
int b_l = b_f+col;                
int n_x = i_x * q;                
int inx = i_x%row*col;            
////////////////////////////////////////////////////////////////////////////////////////
int r_x = 0; int adi = 0; int adb = 0;  
////////////////////////////////////////////////////////////////////////////////////////
if (i_y == 0)
{
for(size_t j=b_f; j<b_l; j++){
    if (cut[j]<q){                
        r_x = inx + j -b_f;       
        adb = n_x + cut[j];       
        adi = bin[adb] + num[adb];
        out[adi] = ind[r_x];      
        num[adb]+= 1;             
    }}
}
////////////////////////////////////////////////////////////////////////////////////////
''','binsort')

binsort_kernel(cuts,cuts,ind,row,col,q,iout,bins,bnum)