如何使用Nvidia Multi-process Service (MPS)来运行多个非MPI CUDA应用程序？

Question

如何使用Nvidia Multi-process Service (MPS)来运行多个非MPI CUDA应用程序？

cudagpugpgpunvidiakepler

13

我能否使用MPS在NVIDIA Kepler GPU上同时运行非MPI CUDA应用程序？我想这样做是因为我的应用程序无法充分利用GPU，所以我希望它们可以一起运行。是否有代码示例来实现此功能？

- dalibocai

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Robert Crovella · Accepted Answer

必要的说明包含在MPS服务的文档中。您会注意到这些说明并不真正依赖于或调用MPI，因此它们与MPI无关。

以下是一个演示/示例。

Read section 2.3 of the above-linked documentation for various requirements and restrictions. I recommend using CUDA 7, 7.5, or later for this. There were some configuration differences with prior versions of CUDA MPS that I won't cover here. Also, I'll demonstrate just using a single server/single GPU. The machine I am using for test is a CentOS 6.2 node using a K40c (cc3.5/Kepler) GPU, with CUDA 7.0. There are other GPUs in the node. In my case, the CUDA enumeration order places my K40c at device 0, but the nvidia-smi enumeration order happens to place it as id 2 in the order. All of these details matter in a system with multiple GPUs, impacting the scripts given below.

I'll create several helper bash scripts and also a test application. For the test application, we'd like something with kernel(s) that can obviously run concurrently with kernels from other instances of the application, and we'd also like something that makes it obvious when those kernels (from separate apps/processes) are running concurrently or not. To meet these needs for demonstration purposes, let's have an app that has a kernel that just runs in a single thread on a single SM, and simply waits for a period of time (we'll use ~5 seconds) before exiting and printing a message. Here's a test app that does that:

$ cat t1034.cu
#include <stdio.h>
#include <stdlib.h>

#define MAX_DELAY 30

#define cudaCheckErrors(msg) \
  do { \
    cudaError_t __err = cudaGetLastError(); \
    if (__err != cudaSuccess) { \
        fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
            msg, cudaGetErrorString(__err), \
            __FILE__, __LINE__); \
        fprintf(stderr, "*** FAILED - ABORTING\n"); \
        exit(1); \
    } \
  } while (0)


#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

#define APPRX_CLKS_PER_SEC 1000000000ULL
__global__ void delay_kernel(unsigned seconds){

  unsigned long long dt = clock64();
  while (clock64() < (dt + (seconds*APPRX_CLKS_PER_SEC)));
}

int main(int argc, char *argv[]){

  unsigned delay_t = 5; // seconds, approximately
  unsigned delay_t_r;
  if (argc > 1) delay_t_r = atoi(argv[1]);
  if ((delay_t_r > 0) && (delay_t_r < MAX_DELAY)) delay_t = delay_t_r;
  unsigned long long difft = dtime_usec(0);
  delay_kernel<<<1,1>>>(delay_t);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel fail");
  difft = dtime_usec(difft);
  printf("kernel duration: %fs\n", difft/(float)USECPSEC);
  return 0;
}


$ nvcc -arch=sm_35 -o t1034 t1034.cu
$ ./t1034
kernel duration: 6.528574s
$

We'll use a bash script to start the MPS server:

$ cat start_as_root.bash
#!/bin/bash
# the following must be performed with root privilege
export CUDA_VISIBLE_DEVICES="0"
nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d
$

And a bash script to launch 2 copies of our test app "simultaneously":
```
$ cat mps_run
#!/bin/bash
./t1034 &
./t1034
$
```

We could also have a bash script to shut down the server, although it's not needed for this walkthrough:

$ cat stop_as_root.bash
#!/bin/bash
echo quit | nvidia-cuda-mps-control
nvidia-smi -i 2 -c DEFAULT
$

Now when we just launch our test app using the mps_run script above, but without actually enabling the MPS server, we get the expected behavior that one instance of the app takes the expected ~5 seconds, whereas the other instance takes approximately double that (~10 seconds) because, since it does not run concurrently with an app from another process, it waits for 5 seconds while the other app/kernel is running, and then spends 5 seconds running its own kernel, for a total of ~10 seconds:
```
$ ./mps_run
kernel duration: 6.409399s
kernel duration: 12.078304s
$
```

On the other hand, if we start the MPS server first, and repeat the test:

$ su
Password:
# ./start_as_root.bash
Set compute mode to EXCLUSIVE_PROCESS for GPU 0000:82:00.0.
All done.
# exit
exit
$ ./mps_run
kernel duration: 6.167079s
kernel duration: 6.263062s
$

we see that both apps take the same amount of time to run, because the kernels are running concurrently, due to MPS.

You're welcome to experiment as you see fit. If this sequence appears to work correctly for you, but running your own application doesn't seem to give the expected results, one possible reason may be that your app/kernels are not able to run concurrently with other instances of the app/kernels due to the construction of your kernels, not anything to do with MPS. You might want to verify the requirements for concurrent kernels, and/or study the concurrentKernels sample app.
Much of the information here was recycled from the test/work done here albeit the presentation here with separate apps is different than the MPI case presented there.

更新: 当在多个进程中运行内核时，非MPS情况下调度程序的行为似乎随着Pascal和更新的GPU而发生了变化。上述测试结果仍适用于已测试的GPU（例如Kepler），但在Pascal或更新的GPU上运行上述测试用例时，将在非MPS情况下观察到不同的结果。调度程序在最新的MPS文档中被描述为“时间片”调度程序，看起来正在发生的是，调度程序可能会根据一些未公开的规则，选择抢占正在运行的内核，以便可以切换到另一个来自另一个进程的内核。这仍然并不意味着来自不同进程的内核在CUDA文档中传统用法中“并发”运行，但上面的代码被时间片调度程序（在Pascal和更新版本中）“欺骗”，因为它依赖于使用SM时钟设置内核持续时间。时间片调度程序加上这种对SM时钟的使用使得这个测试用例似乎是“并发”运行的。然而，正如MPS文档中所描述的那样，在非MPS情况下，来自内核A的代码与来自内核B的代码不在同一时钟周期中执行，当A和B来自不同进程时。

使用上述一般方法展示的另一种方法可能是使用由循环次数设置的内核持续时间，而不是通过读取SM时钟设置的内核持续时间，如此处所述。在这种情况下，必须小心避免编译器将循环“优化掉”。