在CUDA中返回数组的最小和最大元素

Question

在CUDA中返回数组的最小和最大元素

3

我正在使用CUDA进行一些数组操作/计算（通过Cudafy.NET库，尽管我同样对CUDA/C++方法感兴趣），需要计算数组中的最小值和最大值。其中一个核函数如下：

    [Cudafy]
    public static void UpdateEz(GThread thread, float time, float ca, float cb, float[,] hx, float[,] hy, float[,] ez)
    {
        var i = thread.blockIdx.x;
        var j = thread.blockIdx.y;

        if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1)
            ez[i, j] =
                ca * ez[i, j]
                + cb * (hx[i, j] - hx[i - 1, j])
                + cb * (hy[i, j - 1] - hy[i, j])
                ;
    }

我希望能够做到如下操作：

    [Cudafy]
    public static void UpdateEz(GThread thread, float time, float ca, float cb, float[,] hx, float[,] hy, float[,] ez, out float min, out float max)
    {
        var i = thread.blockIdx.x;
        var j = thread.blockIdx.y;

        min = float.MaxValue;
        max = float.MinValue;

        if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1)
        {
            ez[i, j] =
                ca * ez[i, j]
                + cb * (hx[i, j] - hx[i - 1, j])
                + cb * (hy[i, j - 1] - hy[i, j])
                ;

            min = Math.Min(ez[i, j], min);
            max = Math.Max(ez[i, j], max);

        }
    }

有没有方便的方法来返回数组的最小值和最大值（针对整个数组，而不仅是每个线程或块）？

- 3Dave

1

传统上，最小值和最大值是通过缩减操作找到的。我对Cudafy不太熟悉，但那似乎并不像是一个缩减操作。 - alrikai

@alrikai 我很乐意彻底修改和优化我的代码来解决此问题。我已经研究过map/reduce等方法，但实现起来有些难以理解。忘掉cudafy这部分：你会如何直接使用CUDA/C++来处理呢？ - 3Dave

1

你可以使用 thrust 或 npp。 - sgarizvi

1

如果你决定使用 thrust，可以参考这个例子。 - BenC

一个典型的并行规约应该可以工作。例如，我为这个问题编写的代码可以很容易地进行调整。 - Robert Crovella

@alrikai 这不是一个简单的数组，而是一个电磁波模拟器。我试图在更新数组时计算最小和最大值（因为我已经要触及每个元素），但似乎没有办法做到这一点，除非有一个单独的内核来执行归约操作。 - 3Dave

3个回答

1

根据您对问题的评论，您在计算最大值和最小值时尝试同时执行；虽然这是可能的，但不是最有效的方法。如果您坚持要这样做，那么可以针对某个全局最小值和全局最大值进行原子比较，缺点是每个线程都会被串行化，这很可能会成为一个重要的瓶颈。

对于通过减少来查找数组中的最大或最小值的更常规方法，您可以采取以下措施：

#define MAX_NEG ... //some small number

template <typename T, int BLKSZ> __global__
void cu_max_reduce(const T* d_data, const int d_len, T* max_val)
{
    volatile __shared__ T smem[BLKSZ];

    const int tid = threadIdx.x;
    const int bid = blockIdx.x;
        //starting index for each block to begin loading the input data into shared memory
    const int bid_sidx = bid*BLKSZ;

    //load the input data to smem, with padding if needed. each thread handles 2 elements
    #pragma unroll
    for (int i = 0; i < 2; i++)
    {
                //get the index for the thread to load into shared memory
        const int tid_idx = 2*tid + i;
        const int ld_idx = bid_sidx + tid_idx;
        if(ld_idx < (bid+1)*BLKSZ && ld_idx < d_len)
            smem[tid_idx] = d_data[ld_idx];
        else
            smem[tid_idx] = MAX_NEG;

        __syncthreads();
    }

    //run the reduction per-block
    for (unsigned int stride = BLKSZ/2; stride > 0; stride >>= 1)
    {
        if(tid < stride)
        {
            smem[tid] = ((smem[tid] > smem[tid + stride]) ? smem[tid]:smem[tid + stride]);
        }
        __syncthreads();
    }

    //write the per-block result out from shared memory to global memory
    max_val[bid] = smem[0];
}


//assume we have d_data as a device pointer with our data, of length data_len
template <typename T> __host__
T cu_find_max(const T* d_data, const int data_len)
{
    //in your host code, invoke the kernel with something along the lines of:
    const int thread_per_block = 16; 
    const int elem_per_thread = 2;
    const int BLKSZ = elem_per_thread*thread_per_block; //number of elements to process per block
    const int blocks_per_grid = ceil((float)data_len/(BLKSZ));

    dim3 block_dim(thread_per_block, 1, 1);
    dim3 grid_dim(blocks_per_grid, 1, 1);

    T *d_max;
    cudaMalloc((void **)&d_max, sizeof(T)*blocks_per_grid); 

    cu_max_reduce <T, BLKSZ> <<<grid_dim, block_dim>>> (d_data, data_len, d_max);

    //etc....
}

这将找到每个块的最大值。您可以再次在其输出上运行它（例如，以d_max作为输入数据，并使用更新的启动参数）在1个块上找到全局最大值 - 如果数据集太大（在这种情况下，超过2 * 4096个元素，因为我们让每个线程处理2个元素，尽管您可以只处理更多的元素来增加这个数字），则必须以这种多通道方式运行。

我应该指出，这不是特别有效（当加载共享内存时，您需要使用更智能的步幅来避免银行冲突），而且我不能百分之百确定它是否正确（它在我尝试的一些小测试案例上起作用了），但我尽量写得清晰明了。还要不要忘记放入一些错误检查代码，以确保您的CUDA调用成功完成，我在这里省略了它们以使其更短。

我还应该向您推荐一些更深入的文档; 您可以查看CUDA示例降低（http://docs.nvidia.com/cuda/cuda-samples/index.html），虽然它不是在进行最小/最大计算，但它是相同的思路（并且更有效）。另外，如果您想要简单性，您可能只需要使用Thrust的函数thrust::max_element和thrust::min_element，以及文档：thrust.github.com/doc/group__extrema.html

- alrikai

1

您可以使用分治算法开发自己的最小/最大算法。

如果有使用 npp 的可能性，则此函数可能会有用：nppsMinMax_32f。

- simon.denel

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Vitality · Accepted Answer

如果您正在编写电磁波模拟器并且不想重新发明轮子，您可以使用 thrust::minmax_element。下面是一个简单的示例，演示如何使用它。请添加自己的CUDA错误检查。

#include <stdio.h>

#include <cuda_runtime_api.h>

#include <thrust\pair.h>
#include <thrust\device_vector.h>
#include <thrust\extrema.h>

int main()
{
    const int N = 5;

    const float h_a[N] = { 3., 21., -2., 4., 5. };

    float *d_a;     cudaMalloc(&d_a, N * sizeof(float));
    cudaMemcpy(d_a, h_a, N * sizeof(float), cudaMemcpyHostToDevice);

    float minel, maxel;
    thrust::pair<thrust::device_ptr<float>, thrust::device_ptr<float>> tuple;
    tuple = thrust::minmax_element(thrust::device_pointer_cast(d_a), thrust::device_pointer_cast(d_a) + N);
    minel = tuple.first[0];
    maxel = tuple.second[0];

    printf("minelement %f - maxelement %f\n", minel, maxel);

    return 0;
}