我该如何使用OpenMP可用的GPU？

Question

我该如何使用OpenMP可用的GPU？

7

我正在尝试使用OpenMP在GPU上运行代码，但我没有成功。在我的代码中，我使用for循环执行矩阵乘法：一次使用OpenMP pragma标签，一次不使用。（这样我可以比较执行时间。）第一个循环后，我调用 omp_get_num_devices() （这是我测试是否实际连接到GPU的主要方法）。无论我尝试什么，omp_get_num_devices()始终返回0。 我使用的计算机有两个NVIDIA Tesla K40M GPU。CUDA 7.0和CUDA 7.5都可用作模块，并且通常情况下会激活CUDA 7.5模块。gcc 4.9.3，5.1.0和7.1.0都可用作模块，其中gcc 7.1.0模块通常处于活动状态。我使用$ g++ -fopenmp -omptargets=nvptx64sm_35-nvidia-linux ParallelExperimenting.cpp -o ParallelExperimenting编译我的代码。我已经成功地使用CPU并行化了OpenMP代码，但没有使用GPU。 我的主要目标是使omp_get_num_devices()返回2，以证明我可以使用OpenMP检测和使用GPU。在这里获得任何帮助将不胜感激。

以下是我用来检查GPU是否被正确使用的代码：

#include <omp.h>
#include <fstream>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
#include <iomanip>
#include <cstdio>
#include <stdlib.h>
#include <iostream>
#include <time.h>
using namespace std;

double A [501][501];
double B [501][501];
double C [501][501][501];
double D [501][501];
double E [501][501];
double F [501][501][501];
double dummyvar;
int Mapped [501];

int main() {
    int i, j, k, l, N, StallerGPU, StallerCPU;

    //
    N = 500;

    // Variables merely uses to make the execution take longer and to
    //   exaggurate the difference in performance between first and second
    //   calculation
    StallerGPU = 200;
    StallerCPU = 200;

    std::cout << " N = " << N << "\n";
    // generate matrix to be used in first calculation
    for (i=0; i<N; i++) {
        for (k=0; k<N; k++) {
            if (i == k) {
                A[i][k] = i+1;
            } else {
                A[i][k] = i * k / N;
            }
        }
    }
    // generate other matrix to be used for the first calculation
    for (k=0; k<N; k++) {
        for (j=0; j<N; j++) {
            B[k][j] = 2*(N-1)-k-j;
        }
    }

//    Slightly adjusted matrices for second calculation
    for (i=0; i<N; i++) {
        for (k=0; k<N; k++) {
            if (i == k) {
                D[i][k] = i+2;
            } else {
                D[i][k] = i * k / N - 1;
            }
        }
    }

    for (k=0; k<N; k++) {
        for (j=0; j<N; j++) {
            E[k][j] = 2*(N+1)-k-j;
        }
    }

    dummyvar = 0;

    //Run the multiplication in parallel using GPUs

    double diff;
    time_t time1;
    time1 = time( NULL ); // CPU time counter
    cout << endl << " GPU section begins at " << ctime(&time1) << endl;

        //    This pragma is frequently changed to try different tags
        #pragma omp for collapse(4) private(i, j, k, l)

        for (i=0; i<N; i++) {
//            Mapped[i] = omp_is_initial_device();
            for (j=0; j<N; j++) {
                for (k=0; k<N; k++) {
                    for(l = 0; l < StallerGPU; l++ ) {
                        C[i][j][k] = A[i][k] * B[k][j] ;
                        dummyvar += A[i][k] * B[k][j] * (l + 1);
                    }
                }
//            cout << " i " << i << endl;
            }
        }


    //record the time it took to run the multiplication    
    time_t time2 = time( NULL );
    cout << " number of devices: " << omp_get_num_devices() << endl;
    cout << " dummy variable: " << dummyvar << endl;

    float cpumin = difftime(time2,time1);
    diff = difftime(time2,time1);
    cout << " stopping at delta GPU time: " << cpumin << endl; 
    cout << " terminating at " << ctime(&time2) << endl;
    cout << " GPU time elasped " << diff << " s" << endl;
    cout << endl;

    dummyvar = 0;
    time_t time3 = time( NULL );
    cout << endl << " CPU section begins at " << ctime(&time3) << endl;
//    #pragma omp single
    for (i=0; i<N; i++) {
        for (j=0; j<N; j++) {
            for (k=0; k<N; k++) {
                for (int l=0; l<StallerCPU; l++) {
                    F[i][j][k] = D[i][k] * E[k][j];
                    dummyvar += D[i][k] * E[k][j] * (l - 1);
                }
            }
        }
    }
    // the sum to complete the matrix calculation is left out here, but would
    // only be used to check if the result of the calculation is correct

    time_t time4 = time( NULL );
    cpumin = difftime(time4,time3);
    diff = difftime(time4,time3);
    cout << " dummy variable: " << dummyvar << endl;
    cout << " stopping at delta CPU time: " << cpumin << endl; 
    cout << " terminating at " << ctime(&time4) << endl;
    cout << " CPU time elasped " << diff << " s" << endl;
    //Compare the time it took to confirm that we actually used GPUs to parallelize.
}

这是运行deviceQuery示例CUDA代码的结果。

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K40m"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 11520 MBytes (12079136768 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            745 MHz (0.75 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 130 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla K40m"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 11520 MBytes (12079136768 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            745 MHz (0.75 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 131 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : Yes
> Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K40m
Result = PASS

- Josiah

你能上传一个最小化的工作示例来展示你正在尝试做什么吗？ - Richard

欢迎来到 Stack Overflow！很遗憾，您的帖子没有提供 [mcve]。请访问 help center 并阅读如何提出一个好问题章节。 - Captain Obvlious

我也尝试使用clang来使其工作，所以这可能就是我想到-omptargets=的地方。我可以将其删除，但删除它并不会改变我连接到GPU的能力。 - Josiah

啊，现在我明白你的问题了。我一直想错方向了。所以你主要是抱怨omp_get_num_devices()从未返回过1？那么为什么不只使用这个简单的调用来使你的代码真正最小化呢？如果这样可以工作，你应该能够自己完成其余部分。 - BlameTheBits

1

我猜你的GCC不支持向GPU卸载。实际上，这个功能需要安装2个不同的GCC（一个针对主机，一个针对GPU）。我不确定你的发行版中是否有这样的软件包，我更喜欢按照这里所述从源代码构建两者。 - Ilya Verbin

显示剩余5条评论

3个回答

3

GCC 4.9.3和5.1.0绝对不支持OpenMP向GPU的卸载。然而，GCC 7.1.0支持它，但应使用特殊的配置选项构建，具体请参考这里。

- Ilya Verbin

1

也许我方向错了，但我想要帮忙。因为我曾经在使用GPU时遇到过奇怪的情况。

你需要加入Linux的“video”组，这样才能使用GPU。

否则，从GPU返回的所有结果都将为0。

所以我建议你运行一个样例CUDA代码，检查是否陷入了我曾经陷入的情况。

这很奇怪。我不确定我是否描述得恰当。希望有所帮助。

根据此链接：https://wiki.gentoo.org/wiki/NVidia/nvidia-drivers

需要访问显卡的用户需要添加到视频组中。

- Clark Lee

你为什么不提供一个链接参考呢？这是一个好答案，如果有链接参考就会更好听。 - Thecave3

我将添加运行deviceQuery示例CUDA代码的结果。 - Josiah

@MarkSetchell 谢谢。我不知道 AMD GPU 也需要这个。 - Clark Lee

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Arpith Jacob · Accepted Answer

我可能错了，但我认为您需要对发布的代码进行一些更正（也许您已经知道了）。要在GPU目标上使用OpenMP实际运行，您需要替换：

#pragma omp for collapse(4) private(i, j, k, l)

使用

#pragma omp target teams distribute parallel for collapse(4) private(i, j, k, l)

您可以通过使用“nvprof”对您的可执行文件进行分析来验证内核是否实际在GPU上运行。它应该显示一个在GPU上执行的内核。您还可以使用“num_teams”和“thread_limit”子句更改目标区域中的团队和线程数，您应该在您的配置文件中看到相应的更改。

实际上，要编程检查目标区域是否在目标设备上运行，我使用“omp_is_initial_device（）”调用，当从加速器调用时返回0。以下是一个示例：

int A[1] = {-1};
#pragma omp target
{
  A[0] = omp_is_initial_device();
}

if (!A[0]) {
  printf("Able to use offloading!\n");
}