如何在离散显卡 AMD GPU 上运行 Python 脚本？

Question

如何在离散显卡 AMD GPU 上运行 Python 脚本？

4

我的目标:

我有一个脚本，可以在给定范围内因式分解质数:

# Python program to display all the prime numbers within an interval

lower = 900
upper = 1000

print("Prime numbers between", lower, "and", upper, "are:")

for num in range(lower, upper + 1):
   # all prime numbers are greater than 1
   if num > 1:
       for i in range(2, num):
           if (num % i) == 0:
               break
       else:
           print(num)

我希望使用GPU而非CPU来运行此脚本以提高速度。

问题:

我的Intel NUC NUC8i7HVK没有NVIDIA GPU，而是一个“离散GPU”

如果我运行以下代码检查我的GPU：

import pyopencl as cl
import numpy as np

a = np.arange(32).astype(np.float32)
res = np.empty_like(a)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)

prg = cl.Program(ctx, """
    __kernel void sq(__global const float *a,
    __global float *c)
    {
      int gid = get_global_id(0);
      c[gid] = a[gid] * a[gid];
    }
    """).build()

prg.sq(queue, a.shape, None, a_buf, dest_buf)

cl.enqueue_copy(queue, res, dest_buf)

print (a, res)

我收到：

[0] <pyopencl.Platform 'AMD Accelerated Parallel Processing' at 0x7ffb3d492fd0>
[1] <pyopencl.Platform 'Intel(R) OpenCL HD Graphics' at 0x187b648ed80>

解决问题的可能方法：

我找到一篇指南，手把手地讲解了如何在GPU上运行它，并且步骤非常详细。但是所有将Python程序通过GPU进行处理的库，如PyOpenGL，PyOpenCL，Tensorflow (Force python script on GPU)，PyTorch等都是为NVIDIA量身定制的。

如果你有AMD，所有的库都要求安装ROCm，但据我所知，这种软件仍不支持集成GPU或离散GPU（请看下面我的回复）。

我只发现了一篇指南，涉及这种方法，但我无法使其工作。

是否有希望，还是我在尝试做一件不可能的事情？

编辑：回复@chapelo

如果我选择0，则回复为：

Set the environment variable PYOPENCL_CTX='0' to avoid being asked again.
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [  0.   1.   4.   9.  16.  25.  36.  49.  64.  81. 100. 121. 144. 169.
 196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729.
 784. 841. 900. 961.]

如果我选择1，则回复如下：

Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [  0.   1.   4.   9.  16.  25.  36.  49.  64.  81. 100. 121. 144. 169.
 196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729.
 784. 841. 900. 961.]

- Francesco Mantovani

1

我遇到了同样的问题，基本上我不得不去找Nvidia。 - John Stud

@chapelo，谢谢您的关注。很好的问题，我已经发布了回复。我无法确定那意味着什么。 - Francesco Mantovani

它告诉你找到了2个可能的上下文，0是你的AMD，1是你的Intel。你更喜欢哪一个？输入值0或1并查看发生了什么。如果您想使用AMD，则必须设置环境变量PYOPENCL_CTX="0"以避免被询问。 - chapelo

谢谢@chapelo，我选择了0。现在我该如何告诉Python在运行该脚本时使用GPU呢？ - Francesco Mantovani

或者您可以在程序中定义所需的上下文，而不是使用 cl.create_some_context()，而是自己指定上下文，例如 ctx = cl.Context(dev_type=cl.device_type.ALL, properties=[(cl.context_properties.PLATFORM, plat[0])])。 - chapelo

显示剩余6条评论

3个回答

2

pyopencl 能够与您的 AMD 和 Intel GPU 都兼容。并且您已经检查过安装情况。只需设置您的环境变量 PYOPENCL_CTX='0'，每次都能自动使用 AMD，无需再次询问。

或者，您可以通过使用以下代码在程序中定义上下文，而不是使用 ctx = cl.create_some_context()：

platforms = cl.get_platforms()
ctx = cl.Context(
   dev_type=cl.device_type.ALL,
   properties=[(cl.context_properties.PLATFORM, platforms[0])])

不要认为你的AMD在每种情况下都比Intel更好。我曾经遇到过Intel超越其他处理器的情况。我认为这与将数据从CPU复制到其他GPU的成本有关。

话虽如此，我认为并行运行脚本与使用更好的算法相比不会有太大的改进：

- 使用筛选算法获取上限平方根以下的质数。 - 应用类似的筛选算法，使用前一步中的质数来筛选您的下限和上限之间的数字。

也许这不是一个可以轻松并行运行的算法的好例子，但您已准备好尝试另一个示例。

然而，为了向您展示如何使用GPU解决此问题，请考虑以下更改：

串行算法可能如下所示：

from math import sqrt

def primes_below(number):
    n = lambda a: 2 if a==0 else 2*a + 1
    limit = int(sqrt(number)) + 1
    size = number//2
    primes = [True] * size
    for i in range(1, size):
        if primes[i]:
            num = n(i)
            for j in range(i+num, size, num):
                primes[j] = False
    for i, flag in enumerate(primes):
        if flag: yield n(i)

def primes_between(lo, hi):
    primes = list(primes_below(int(sqrt(hi))+1))
    size = (hi - lo - (0 if hi%2 else 1))//2 + 1
    n = lambda a: 2*a + lo + (0 if lo%2 else 1)
    numbers = [True]*size
    for i, prime in enumerate(primes):
        if i == 0: continue
        start = 0
        while (n(start)%prime) != 0: 
            start += 1
        for j in range(start, size, prime):
            numbers[j] = False
    for i, flag in enumerate(numbers):
        if flag: yield n(i)

这段代码会打印出1e6到5e6之间的质数列表，仅用时0.64秒。

当我尝试使用我的GPU来运行你的脚本时，它在5分钟内没有完成。针对一个10倍更小的问题：1e5到5e5之间的质数，大约需要29秒钟。

修改脚本，使GPU中的每个进程将一个奇数（测试偶数没有意义）除以预先计算的质数列表，直到质数大于该数字自身的平方根时停止，就可以在0.50秒内完成相同的任务。这是一次改进！

以下是代码：

import numpy as np
import pyopencl as cl
import pyopencl.algorithm
import pyopencl.array

def primes_between_using_cl(lo, hi):

    primes = list(primes_below(int(sqrt(hi))+1))

    numbers_h = np.arange(  lo + (0 if lo&1 else 1), 
                            hi + (0 if hi&1 else 1),
                            2,
                            dtype=np.int32)

    size = (hi - lo - (0 if hi%2 else 1))//2 + 1

    code = """\
    __kernel 
    void is_prime( __global const int *primes,
                   __global       int *numbers) {
      int gid = get_global_id(0);
      int num = numbers[gid];
      int max = (int) (sqrt((float)num) + 1.0);
      for (; *primes; ++primes) {
   
        if (*primes <= max && num % *primes == 0) {
          numbers[gid] = 0;
          return;
        }
      }
    }
    """

    platforms = cl.get_platforms()
    ctx = cl.Context(dev_type=cl.device_type.ALL,
       properties=[(cl.context_properties.PLATFORM, platforms[0])])     
    queue = cl.CommandQueue(ctx)
    prg = cl.Program(ctx, code).build()
    numbers_d = cl.array.to_device(queue, numbers_h)

    primes_d = cl.array.to_device(queue,
                                  np.array(primes[1:], # don't need 2
                                  dtype=np.int32))

    prg.is_prime(queue, (size, ), None, primes_d.data, numbers_d.data)

    array, length = cl.algorithm.copy_if(numbers_d, "ary[i]>0")[:2]

    yield from array.get()[:length.get()]

- chapelo

检查我所做的修改，我在末尾发布了一个示例。尝试运行它。 - chapelo

你的代码看起来很有前途，但是两个脚本都没有返回任何内容 https://snipboard.io/M7N1Dy.jpg 。也许是因为实际上它里面没有 print() ？另外我如何设置 hi 和 lo 呢？ - Francesco Mantovani

我给你的函数是生成器。你必须在程序中使用它们，并提供每种情况下相关的输入值，输出将用于某些有用的东西，我猜测是这样的。仅仅打印结果几乎没有任何作用。使用hi和lo（例如您的下限和上限值）的值从程序中调用函数，并逐个使用结果或将它们放入列表中。我以为你知道这些简单的事情。 - chapelo

我的知识肯定比你低，例如这是我第一次看到 yield，在此之前我不知道它的存在。我只想在终端上打印结果。如果我用 return 替换 yield from，它会显示 NameError: name 'primes_below' is not defined。我该如何仅打印从1到1000的数字？我只是想检查你的解决方案是否有效。谢谢。 - Francesco Mantovani

我会将我的回复作为另一个答案发布，以保持清晰和简洁。 - chapelo

1

以下代码是一个完整的Python程序示例，通常包括：

导入语句
函数定义
main()函数
if __name__ == "__main__": 部分。

我希望这能帮助您解决问题。

import pyprimes
from math import sqrt
import numpy as np

import pyopencl as cl
import pyopencl.algorithm
import pyopencl.array

def primes_below(number):
    """Generate a list of prime numbers below a specified  `number`"""
    n = lambda a: 2 if a==0 else 2*a + 1
    limit = int(sqrt(number)) + 1
    size = number//2
    primes = [True] * size
    for i in range(1, size):
        if primes[i]:
            num = n(i)
            if num > limit: break
            for j in range(i+num, size, num):
                primes[j] = False
    for i, flag in enumerate(primes):
        if flag:
            yield n(i)

def primes_between(lo, hi):
    """Generate a list of prime numbers betwenn `lo` and `hi` numbers"""
    primes = list(primes_below(int(sqrt(hi))+1))
    size = (hi - lo - (0 if hi%2 else 1))//2 + 1
    n = lambda a: 2*a + lo + (0 if lo%2 else 1)
    numbers = [True]*size
    for i, prime in enumerate(primes):
        if i == 0: continue # avoid dividing by 2
        nlo = n(0)
        # slower # start = prime * (nlo//prime + 1) if nlo%prime else 0
        start = 0
        while (n(start)%prime) != 0: 
            start += 1
        for j in range(start, size, prime):
            numbers[j] = False
    for i, flag in enumerate(numbers):
        if flag:
            yield n(i)

def primes_between_using_cl(lo, hi):
    """Generate a list of prime numbers betwenn a lo and hi numbers
    this is a parallel algorithm using pyopencl"""
    primes = list(primes_below(int(sqrt(hi))+1))
    size_primes_h = np.array( (len(primes)-1, ), dtype=np.int32)
    numbers_h = np.arange(  lo + (0 if lo&1 else 1), 
                                  hi + (0 if hi&1 else 1),
                                  2,
                                  dtype=np.int32)
    size = (hi - lo - (0 if hi%2 else 1))//2 + 1
    code = """\
    __kernel 
    void is_prime( __global const int *primes,
                        __global         int *numbers) {
      int gid = get_global_id(0);
      int num = numbers[gid];
      int max = (int) (sqrt((float)num) + 1.0);
      for (; *primes; ++primes) {
         if (*primes > max) break;
         if (num % *primes == 0) {
            numbers[gid] = 0;
            return;
         }
      }
    }
    """
    platforms = cl.get_platforms()
    ctx = cl.Context(dev_type=cl.device_type.ALL,
        properties=[(cl.context_properties.PLATFORM, platforms[0])])
    queue = cl.CommandQueue(ctx)
    prg = cl.Program(ctx, code).build()
    numbers_d = cl.array.to_device(queue, numbers_h)
    primes_d = cl.array.to_device(queue, np.array(primes[1:], dtype=np.int32))
    prg.is_prime(queue, (size, ), None, primes_d.data, numbers_d.data)
    array, length = cl.algorithm.copy_if(numbers_d, "ary[i]>0")[:2]
    yield from array.get()[:length.get()]

def test(f, lo, hi):
    """Test that all prime numbers are generated by comparing with the
    output of the library `pyprimes`"""
    a = filter(lambda p: p>lo, pyprimes.primes_below(hi))
    b = f(lo, hi)
    result = True
    for p, q in zip (a, b):
        if p != q:
            print(p, q)
            result = False
    return result
    
def main():
    lower = 1000
    upper = 5000
    print("The prime numbers between {} and {}, are:".format(lower,upper))
    print()
    for p in primes_between_using_cl(lower, upper):
        print(p, end=' ')
    print()

if __name__ == '__main__':
    main()

- chapelo

非常感谢@chapelo。然而我仍有很多疑问：我可以设置platforms[0]或platforms[1]，但脚本总是使用Radeon而从不使用Intel。此外：脚本只使用了GPU的10％，没有超过这个值：https://snipboard.io/sNqzWk.jpg - Francesco Mantovani

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Francesco Mantovani · Accepted Answer

经过广泛的研究和多次尝试，我得出了以下结论：

PyOpenGL：主要与NVIDIA配合使用。如果您使用的是AMD GPU，则需要安装ROCm。
PyOpenCL：主要与NVIDIA配合使用。如果您使用的是AMD GPU，则需要安装ROCm。
TensorFlow：主要与NVIDIA配合使用。如果您使用的是AMD GPU，则需要安装ROCm。
PyTorch：主要与NVIDIA配合使用。如果您使用的是AMD GPU，则需要安装ROCm。

我安装了ROCm，但如果我运行rocminfo，它会返回：

ROCk module is NOT loaded, possibly no GPU devices
Unable to open /dev/kfd read-write: No such file or directory
Failed to get user name to check for video group membership
hsa api call failure at: /src/rocminfo/rocminfo.cc:1142
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

clinfo 命令返回以下信息：

Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.0 AMD-APP (3212.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No devices found in platform

rocm-smi返回：

Segmentation fault

因为在官方指南中提到："Ryzen的集成GPU不是ROCm的官方支持目标"，而我的GPU是集成型的，所以不在支持范围内。

我将停止浪费时间，可能会购买NVIDIA或AMD的外置GPU。