使用CUDA进行行列式计算

Question

使用CUDA进行行列式计算

cudagpulinear-algebragpgpunvidia

6

是否有任何库或自由可用的代码可以完全在GPU上计算一个小型（6x6），双精度矩阵的行列式？

- Paul Caheny

4

为什么要在GPU上计算小矩阵的行列式？ - Bart

实时应用在GPU上，需要每秒计算数百次行列式。行列式计算的输入和输出在GPU上产生和消耗。 - Paul Caheny

如果你有大量的矩阵、单个巨大的矩阵或者计算作为更大整体的一部分，那么这是可以接受的。但是，如果你想要一个单独的内核来处理一个6x6的矩阵，我认为几乎没有什么可获得的。 - Bart

计算是整个系统的一部分，正如我所解释的那样，它的输入是在GPU上生成的，输出也在GPU上使用，每秒钟可以进行数百次。 - Paul Caheny

NVIDIA为注册开发者提供了批量Ax=b求解和小矩阵批量矩阵求逆的示例代码，但该代码不包括行列式计算。您可以尝试将矩阵求逆框架适应于行列式计算，但需要自己编写代码。 - njuffa

@PaulCaheny 可能需要编辑您的问题以表明有数百个小矩阵。我认为人们对您进行了负面评价，因为解决6x6行列式是一个非常糟糕的想法。 - Pavan Yalamanchili

3个回答

4

这是计划，你需要缓冲100个这些微小矩阵，并启动一次内核以一次性计算它们的行列式。

我不会写实际的代码，但这应该有帮助。

1）启动#块=#矩阵。每个块计算每个矩阵的行列式。

2) det（A）=det（A11*A22-A21*A12）;其中A为6x6，A11、A12、A21、A22为A的3x3子矩阵。

3) 编写一个设备函数，用于进行3x3矩阵的矩阵乘法

4) 计算3x3矩阵的行列式很简单：使用此处的公式。

编辑：显然（2）只在A21 * A12 == A12 * A21的情况下有效

另一种选择是：

1）对每个6x6矩阵进行高斯消元LU分解 2）将U的对角线元素相乘即可得到行列式。

- Pavan Yalamanchili

2

请注意，第二步只在某些情况下有效，例如如果A21A22=A22A21。请参见[wikipedia]（http://en.wikipedia.org/wiki/Determinant#Block_matrices） - Peter de Rivaz

-2

你可以使用OpenCL或CUDA作为库，并编写一个简短的程序（在OpenCL中的内核），以在GPU上计算行列式。

CUDA http://www.nvidia.de/object/cuda_home_new_de.html

OpenCL http://www.khronos.org/opencl/

http://www.csd.uwo.ca/~moreno/Publications/DetHpcsPaper-proceedings.pdf

这篇论文应该包含CUDA的伪代码。

- glethien

2

我知道我可以这样做，这就是为什么我问是否有任何库或自由可用的代码，这样我就不必自己写了！ - Paul Caheny

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Vitality · Accepted Answer

正如Bart在上面的评论中特别指出的那样，即使对于许多小矩阵的行列式计算，使用GPU也不能保证比其他计算平台更高效。我认为计算矩阵行列式的问题本身是一个有趣的问题，可能会在应用程序中多次出现。目前，我不知道任何提供CUDA行列式计算例程的库（包括cuBLAS和cuSOLVER都没有这样的例程），因此您有两种选择：一是实现自己的方法，如Pavan所指出的；二是设法使用其他可用的例程。关于最后一点，一种可能性是使用Cholesky分解，然后将行列式计算为Cholesky矩阵对角线元素乘积的平方。在Matlab中，它将读取：

prod(diag(chol(A)))^2

下面提供了一段利用这个思想的代码。特别地，使用 cuSOLVER 的 potrf 函数执行 Cholesky 分解，而 Cholesky 矩阵对角线上元素的乘积是 Thrust 中带步长的约简的应用。

以下代码适用于大矩阵，因此对于需要计算大矩阵行列式的人们非常有用。但是如何将其适应多个小矩阵？一种可能性是使用 cuSOLVER 的流进行 Cholesky 分解，然后使用 Thurst 1.8 的新动态并行特性。请注意，从 CUDA 7.0 开始，cuSOLVER 不允许使用动态并行。

以下是代码：

#include "cuda_runtime.h"
#include "device_launch_paraMeters.h"

#include<iostream>
#include<iomanip>
#include<stdlib.h>
#include<stdio.h>
#include<assert.h>
#include<ostream>

#include <cusolverDn.h>
#include <cublas_v2.h>
#include <cuda_runtime_api.h>

#include "Utilities.cuh"

#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/functional.h>

#include <thrust/fill.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

#include <thrust/copy.h>

/*************************/
/* STRIDED RANGE FUNCTOR */
/*************************/
template <typename Iterator>
class strided_range
{
    public:

    typedef typename thrust::iterator_difference<Iterator>::type difference_type;

    struct stride_functor : public thrust::unary_function<difference_type,difference_type>
    {
        difference_type stride;

        stride_functor(difference_type stride)
            : stride(stride) {}

        __host__ __device__
        difference_type operator()(const difference_type& i) const
        {
            return stride * i;
        }
    };

    typedef typename thrust::counting_iterator<difference_type>                   CountingIterator;
    typedef typename thrust::transform_iterator<stride_functor, CountingIterator> TransformIterator;
    typedef typename thrust::permutation_iterator<Iterator,TransformIterator>     PermutationIterator;

    // type of the strided_range iterator
    typedef PermutationIterator iterator;

    // construct strided_range for the range [first,last)
    strided_range(Iterator first, Iterator last, difference_type stride)
        : first(first), last(last), stride(stride) {}

    iterator begin(void) const
    {
        return PermutationIterator(first, TransformIterator(CountingIterator(0), stride_functor(stride)));
    }

    iterator end(void) const
    {
        return begin() + ((last - first) + (stride - 1)) / stride;
    }

    protected:
    Iterator first;
    Iterator last;
    difference_type stride;
};

int main(void)
{
    const int Nrows = 5;
    const int Ncols = 5;

    const int STRIDE = Nrows + 1;

    // --- Setting the host, Nrows x Ncols matrix
    double h_A[Nrows][Ncols] = { 
        { 2.,    -2.,    -2.,    -2.,    -2.,},  
        {-2.,     4.,     0.,     0.,     0.,}, 
        {-2.,     0.,     6.,     2.,     2.,}, 
        {-2.,     0.,     2.,     8.,     4.,}, 
        {-2.,     0.,     2.,     4.,     10.,}
    };

    // --- Setting the device matrix and moving the host matrix to the device
    double *d_A;            gpuErrchk(cudaMalloc(&d_A,      Nrows * Ncols * sizeof(double)));
    gpuErrchk(cudaMemcpy(d_A, h_A, Nrows * Ncols * sizeof(double), cudaMemcpyHostToDevice));

    // --- cuSOLVE input/output parameters/arrays
    int work_size = 0;
    int *devInfo;           gpuErrchk(cudaMalloc(&devInfo,          sizeof(int)));

    // --- CUDA solver initialization
    cusolverDnHandle_t solver_handle;
    cusolverDnCreate(&solver_handle);

    // --- CUDA CHOLESKY initialization
    cusolveSafeCall(cusolverDnDpotrf_bufferSize(solver_handle, CUBLAS_FILL_MODE_LOWER, Nrows, d_A, Nrows, &work_size));

    // --- CUDA POTRF execution
    double *work;   gpuErrchk(cudaMalloc(&work, work_size * sizeof(double)));
    cusolveSafeCall(cusolverDnDpotrf(solver_handle, CUBLAS_FILL_MODE_LOWER, Nrows, d_A, Nrows, work, work_size, devInfo));
    int devInfo_h = 0;  gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost));
    if (devInfo_h != 0) std::cout   << "Unsuccessful potrf execution\n\n";

    cusolverDnDestroy(solver_handle);

    // --- Strided reduction of the elements of d_A: calculating the product of the diagonal of the Cholesky factorization  
    thrust::device_ptr<double> dev_ptr = thrust::device_pointer_cast(d_A);
    typedef thrust::device_vector<double>::iterator Iterator;
    strided_range<Iterator> pos(dev_ptr, dev_ptr + Nrows * Ncols, STRIDE);

    double det = thrust::reduce(pos.begin(), pos.end(), 1., thrust::multiplies<double>());
    det  = det * det;

    printf("determinant = %f\n", det);

    return 0;
}