如何在CUDA中为多个Thrust对象成员函数调用内核函数？

Question

如何在CUDA中为多个Thrust对象成员函数调用内核函数？

3

请注意，我对CUDA是一个绝对的新手，下面的所有内容都是未经测试的伪代码。我来自JavaScript，我的C ++也很生疏，所以对我的无知表示歉意 :)

我正在尝试使用CUDA来回测许多不同的外汇策略。

使用Thrust，我有1000个从类实例化的对象（伪代码）：

#include <stdio.h>
#include <thrust/device_ptr.h>
#include <thrust/device_new.h>

#define N 1000

typedef struct dataPoint {
    ...
} dataPoint;

class Strategy {
    public:
        __device__ __host__ Strategy(...) {
            ...
        }

        __device__ __host__ void backtest(dataPoint data) {
            ...
        }
};

int main() {
    dataPoint data[100000];
    thrust::device_ptr<Strategy> strategies[1000];
    int i;

    // Instantiate 5000 strategies.
    for (i=0; i<1000; i++) {
        thrust::device_ptr<Strategy> strategies[i] = thrust::device_new<Strategy>(...);
    }

    // Iterate over all 100000 data points.
    for (i=0; i<100000; i++) {
        // Somehow run .backtest(data[j]) on each strategy here.
        // i.e. Run backtest() in parallel for all 1000
        // strategy objects here.
    }
}

现在假设我想在每个数据项的每个对象上运行.backtest()方法。按照程序顺序，我将执行以下操作：

// Iterate over all 100000 data points.
for (j=0; j<100000; j++) {
    // Iterate over all 1000 strategies.
    for (i=0; i<1000; i++) {
        strategies[i].backtest(data[j]);
    }
}

我该如何使用CUDA实现这个功能，以便在每个迭代j中并行运行所有策略的.backtest()？如果我必须完全重新设计一切，那就这样吧--我愿意接受任何必要的改变。如果类无法实现此功能，那么也没关系。

- Chad Johnson

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Robert Crovella · Accepted Answer

典型的推力代码经常使用某些C++惯用语（例如函数对象），因此如果您的C++有点生疏，您可能需要阅读关于C++函数对象的内容。您还可以查看thrust快速入门指南，其中讨论了函数对象以及我们即将使用的高级迭代器。

总的来说，从表达式的角度来看，我认为thrust非常适合您的问题描述。鉴于thrust表达式在这些类型的问题中的灵活性，可能有很多方法来解决这个问题。我将尝试呈现与您的伪代码“接近”的内容。但是，无疑有许多实现方式。

首先，在thrust中，我们通常避免使用for循环。这些将非常慢，因为它们通常涉及每次迭代的主机代码和设备代码交互（例如，在内部调用CUDA内核）。如果可能的话，我们更喜欢使用thrust算法，因为这些通常会“转换”为一个或少量的CUDA内核，在内部运行。

Thrust中最基本的算法之一是transform。它有多种变体，但基本上是将输入数据逐个元素地应用任意操作。

使用基本的Thrust转换操作，我们可以初始化您的数据以及策略，而不必使用for循环。我们将为每种对象（dataPoint，Strategy）构造一个适当长度的设备向量，然后使用thrust :: transform来初始化每个向量。

这使我们面临着执行每个dataPoint对每个Strategy的任务。理想情况下，我们不仅想在每个for循环迭代中执行此操作，而且想要并行执行每个Strategy与每个dataPoint，所有这些操作都在“同时”进行（即在单个算法调用中）。

实际上，我们可以将一个矩阵看作是一个轴由 dataPoint 组成（在您的示例中为 100000 维），另一个轴由 Strategy 组成（在您的示例中为 1000 维）。对于此矩阵中的每个点，我们设想它保存了该 Strategy 对该 dataPoint 的应用结果。

在 thrust 中，我们通常更喜欢将这样的二维概念实现为一维。因此，我们的结果空间等于 dataPoint 数量乘以 Strategy 数量的积。我们将创建一个大小为此的 result device_vector（在您的示例中为 100000*1000）来保存结果。

为了演示起见，由于您没有提供关于要执行的算术类型的指导，我们将假定以下内容：

the result of applying a Strategy against a dataPoint is a float.
the dataPoint is a struct consisting of an int (dtype - ignored for this example) and a float (dval). dval will contain, for dataPoint(i), 1.0f + i*10.0f.
the Strategy consists of a multiplier and an adder, to be used as follows:
```
Strategy(i) = multiplier(i) * dval + adder(i);
```
applying a Strategy against a dataPoint consists of retrieving the dval associated with the dataPoint, and substituting it into the equation given by item 3 above. This equation is captured in the backtest method of the class Strategy. The backtest method takes an object of type dataPoint as its argument, from which it will retrieve the appropriate dval.

还有一些概念需要介绍。2D结果矩阵的1D实现将要求我们提供适当的索引方式，以便在2D矩阵中的每个点上，给定其线性尺寸，我们可以确定将使用哪个策略和哪个数据点来计算该点处的结果。在Thrust中，我们可以使用精巧迭代器的组合来实现此目的。

简单来说，从“内而外”开始，我们将使用一个转换迭代器，该迭代器采用索引映射函数和由thrust:: counting_iterator提供的线性序列，以便为每个索引（每个矩阵维度）创建一个映射。每个映射函数中的算术运算将把result的线性索引转换为适当的重复索引，以便于矩阵的行和列。给定这个转换迭代器来创建重复的行或列索引，我们将该索引传递给排列迭代器，该迭代器选择每个指示行/列的适当的dataPoint或Strategy。然后，这两个项目（dataPoint，Strategy）在zip_iterator中一起压缩。然后将zip_iterator传递给run_strat函数对象，该函数对象实际上计算应用于给定dataPoint的给定Strategy。

以下是概述上述概念的示例代码：

#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <math.h>

#define TOL 0.00001f


// number of strategies
#define N 1000
// number of data points
#define DSIZE 100000

// could use int instead of size_t here, for these problem dimensions
typedef size_t idx_t;


struct dataPoint {

  int dtype;
  float dval;
};

class Strategy {

    float multiplier;
    float adder;
    idx_t id;

    public:

        __device__ __host__ Strategy(){
          id = 0;
          multiplier = 0.0f;
          adder = 0.0f;
          }
        __device__ __host__ Strategy(idx_t _id) {
          id = _id;
          multiplier = 1.0f + ((float)id)/(float)N;
          adder = (float)id;
        }

        __device__ __host__ float backtest(dataPoint data) {
          return multiplier*data.dval+adder;
        }
};

// functor to initialize dataPoint
struct data_init
{
  __host__ __device__
  dataPoint operator()(idx_t id){
    dataPoint temp;
    temp.dtype = id;
    temp.dval = 1.0f + id * 10.0f;
    return temp;
    }
};

// functor to initialize Strategy
struct strat_init
{
  __host__ __device__
  Strategy operator()(idx_t id){
    Strategy temp(id);
    return temp;
    }
};

// functor to "test" a Strategy against a dataPoint, using backtest method
struct run_strat
{
  template <typename T>
  __host__ __device__
  float operator()(idx_t id, T t){
    return (thrust::get<0>(t)).backtest(thrust::get<1>(t));
    }
};

// mapping functor to generate "row" (Strategy) index from linear index
struct strat_mapper : public thrust::unary_function<idx_t, idx_t>
{
  __host__ __device__
  idx_t operator()(idx_t id){
    return id/DSIZE;
    }
};

// mapping functor to generate "column" (dataPoint) index from linear index
struct data_mapper : public thrust::unary_function<idx_t, idx_t>
{
  __host__ __device__
  idx_t operator()(idx_t id){
    return id%DSIZE;
    }
};



int main() {
    // initialize data
    thrust::device_vector<dataPoint> data(DSIZE);
    thrust::transform(thrust::counting_iterator<idx_t>(0), thrust::counting_iterator<idx_t>(DSIZE), data.begin(), data_init());

    // initialize strategies
    thrust::device_vector<Strategy> strategies(N);
    thrust::transform(thrust::counting_iterator<idx_t>(0), thrust::counting_iterator<idx_t>(N), strategies.begin(), strat_init());

    // test each data point against each strategy

        // Somehow run .backtest(data[j]) on each strategy here.
        // i.e. Run backtest() in parallel for all 1000
        // strategy objects here.

    // allocate space for results for each datapoint against each strategy
    thrust::device_vector<float> result(DSIZE*N);
    thrust::transform(thrust::counting_iterator<idx_t>(0), thrust::counting_iterator<idx_t>(DSIZE*N), thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(strategies.begin(), thrust::make_transform_iterator(thrust::counting_iterator<idx_t>(0), strat_mapper())), thrust::make_permutation_iterator(data.begin(), thrust::make_transform_iterator(thrust::counting_iterator<idx_t>(0), data_mapper())))), result.begin(), run_strat());

    // validation
    // this would have to be changed if you change initialization of dataPoint
    // or Strategy
    thrust::host_vector<float> h_result = result;
    for (int j = 0; j < N; j++){
      float m =  1.0f + (float)j/(float)N;
      float a = j;
      for (int i = 0; i < DSIZE; i++){
        float d =  1.0f + i*10.0f;
        if (fabsf(h_result[j*DSIZE+i] - (m*d+a))/(m*d+a) > TOL) {std::cout << "mismatch at: " << i << "," << j << " was: " << h_result[j*DSIZE+i] << " should be: " << m*d+a << std::endl; return 1;}}}
    return 0;
}

注意事项：

如上所述，这是一种可能的实现。我认为它应该是“相当”高效的，但在Thrust中可能会有更高效的实现。在尝试优化之前，可能需要对您实际的策略和回测方法进行更全面的分析。
最终的transform操作使用counting_iterator作为第一个参数（以及第二个参数），但这实际上被忽略了，并且是“虚拟”的用法，只是为了适当地调整问题的规模。它可以通过更简单的实现来消除，但在我看来，最简单的方法（而不会进一步混淆代码）是使用C++11的auto来定义zip_iterator，然后仅传递它本身以及它的偏移版本给thrust::transform，使用只接受一个输入向量的版本。我认为这不应该对性能产生太大影响，而且我觉得这样稍微容易理解一些，但也许不是。