许多小矩阵通过固定向量进行并行乘法

Question

许多小矩阵通过固定向量进行并行乘法

4

情况如下：我有数千个元素，这些元素由不同维度的小矩阵组成，例如4x2、9x3等等。所有矩阵的维度都相同。

我想用预先计算好的固定向量来乘以每个矩阵。简而言之：

for(i = 1...n)
    X[i] = M[i] . N;

如何使用Thrust并行地处理这个问题？我应该如何安排内存中的数据结构？

注：可能有更适合在GPU上执行此操作的专用库。但我选择使用Thrust，因为它可以部署到不同的后端，而不仅仅是CUDA。

- user2425792

2个回答

-1

当寻找一个专门用于矩阵相乘的简洁软件库时，可以看一下https://github.com/hfp/libxsmm。下面的代码根据典型的GEMM参数请求一个专门的矩阵核心（请注意，有一些限制条件适用）。

double alpha = 1, beta = 1;
const char transa = 'N', transb = 'N';
int flags = LIBXSMM_GEMM_FLAGS(transa, transb);
int prefetch = LIBXSMM_PREFETCH_AUTO;
libxsmm_blasint m = 23, n = 23, k = 23;
libxsmm_dmmfunction xmm = NULL;

xmm = libxsmm_dmmdispatch(m, n, k,
  &m/*lda*/, &k/*ldb*/, &m/*ldc*/,
  &alpha, &beta, &flags, &prefetch);

给定上述代码，可以在没有特定数据结构的情况下为整个一系列（小型）矩阵运行“xmm”（下面的代码也使用了“预取位置”）。

if (0 < n) { /* check that n is at least 1 */
  # pragma parallel omp private(i)
  for (i = 0; i < (n - 1); ++i) {
    const double *const ai = a + i * asize;
    const double *const bi = b + i * bsize;
    double *const ci = c + i * csize;
    xmm(ai, bi, ci, ai + asize, bi + bsize, ci + csize);
  }
  xmm(a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize,
  /* pseudo prefetch for last element of batch (avoids page fault) */
      a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize);
}

除了上面展示的手动循环控制，libxsmm_gemm_batch（或libxsmm_gemm_batch_omp）也可以使用（请参见ReadTheDocs）。如果存在描述操作数系列（A、B和C矩阵）的数据结构，则后者非常有用。

这个库之所以能够提供卓越的性能，有两个原因：（1）使用内存代码生成技术进行即时代码专门化，（2）在计算当前乘积的同时加载下一个矩阵操作数。

（如果您正在寻找与C/C++良好融合的东西，那么该库支持它。但是，它不针对CUDA/Thrust。）

- hfp

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Robert Crovella · Accepted Answer

一种可能的方法：

将数组（矩阵）展平为单个数据向量。这是实现通用推进处理的有利步骤。
使用分组范围机制，将缩放向量扩展到整个展平的数据向量的长度。
使用thrust::transform和thrust::multiplies将两个向量相乘。

如果您需要从展平的数据向量（或结果向量）中稍后访问矩阵，则可以使用指针算术或高级迭代器的组合来实现。

如果您需要重复使用扩展的缩放向量，则可能要完全按照第2步中概述的方法（即使用该方法创建一个实际的向量，长度为N个矩阵，重复）。如果您只需要执行一次此操作，则可以使用计数迭代器，然后是变换迭代器（按元素的矩阵长度取模），然后是置换迭代器，以索引到原始缩放向量（长度为1个矩阵）。

以下示例实现了上述内容，而不使用分组范围迭代器方法：

#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/functional.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/transform.h>

#define N_MAT 1000
#define H_MAT 4
#define W_MAT 3
#define RANGE 1024

struct my_modulo_functor : public thrust::unary_function<int, int>
{
  __host__ __device__
  int operator() (int idx) {
    return idx%(H_MAT*W_MAT);}
};

int main(){

  thrust::host_vector<int> data(N_MAT*H_MAT*W_MAT);
  thrust::host_vector<int> scale(H_MAT*W_MAT);
  // synthetic; instead flatten/copy matrices into data vector
  for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++) data[i] = rand()%RANGE;
  for (int i = 0; i < H_MAT*W_MAT; i++) scale[i] = rand()%RANGE;

  thrust::device_vector<int> d_data = data;
  thrust::device_vector<int> d_scale = scale;
  thrust::device_vector<int> d_result(N_MAT*H_MAT*W_MAT);

  thrust::transform(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_scale.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), my_modulo_functor())) ,d_result.begin(), thrust::multiplies<int>());

  thrust::host_vector<int> result = d_result;

  for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++)
    if (result[i] != data[i] * scale[i%(H_MAT*W_MAT)]) {std::cout << "Mismatch at: " << i << " cpu result: " << (data[i] * scale[i%(H_MAT*W_MAT)]) << " gpu result: " << result[i] << std::endl; return 1;}
  std::cout << "Success!" << std::endl;
  return 0;
}

编辑：回答下面的问题：

使用高级迭代器（例如transform(numbers, iterator)）的好处在于，与组装other number（需要额外步骤和数据移动）然后将其传递给transform(numbers, other numbers)相比，它们通常允许消除额外的数据副本和/或数据移动。如果您只想使用other numbers一次，则通常情况下，使用高级迭代器更好。如果您要再次使用other numbers，则可能需要显式地组装它。这个演示文稿很有启示性，特别是 "Fusion" 部分。

对于一次性使用other numbers，使用高级迭代器和函数对象即时组装的开销通常比显式创建新向量，然后将该新向量传递给transform例程的开销要低。