在CUDA设备代码中使用std::vector

Question

在CUDA设备代码中使用std::vector

54

问题是：是否有一种方法可以在Cuda内核中使用类“vector”？当我尝试时，我会得到以下错误：

error : calling a host function("std::vector<int, std::allocator<int> > ::push_back") from a __device__/__global__ function not allowed

有没有办法在全局区使用向量？我最近尝试了以下步骤：

创建一个新的Cuda项目
进入项目属性
打开Cuda C/C++
进入设备部分
将"代码生成"中的值更改为如下数值：compute_20，sm_20

之后，我能够在Cuda内核函数中使用printf标准库函数。

是否有一种类似使用内核代码中支持printf的方式来使用标准库类vector的方法？这是一个在内核代码中使用printf的示例：

// this code only to count the 3s in an array using Cuda
//private_count is an array to hold every thread's result separately 

__global__ void countKernel(int *a, int length, int* private_count) 
{
    printf("%d\n",threadIdx.x);  //it's print the thread id and it's working

    // vector<int> y;
    //y.push_back(0); is there a possibility to do this?

    unsigned int offset  = threadIdx.x * length;
    int i = offset;
    for( ; i < offset + length; i++)
    {
        if(a[i] == 3)
        {
            private_count[threadIdx.x]++;
            printf("%d ",a[i]);
        }
    }   
}

- Yasser

6

+1非常合理的问题（不确定为什么它被投票否决了）。不幸的是，目前的答案是否定的。 - harrism

抱歉打扰了。只是想知道现在是否有任何答案。 - Aroli Marcellinus

5个回答

14

在CUDA库Thrust中，您可以使用thrust :: device_vector<classT>在设备上定义向量，并且主机STL vector和device_vector之间的数据传输非常简单。您可以参考this有用的链接找到一些有用的示例。

但请注意，device_vector本身不能在设备代码中使用。只能在指针/迭代器中使用。

- Zhong Hu

7

在设备代码中，您不能使用 std::vector，您应该使用数组。

- yyfn

6

我认为你可以自己实现一个设备向量，因为CUDA支持在设备代码中进行动态内存分配。操作符new/delete也被支持。这是一个极其简单的CUDA设备向量原型，但它确实有效。它还没有经过充分测试。

template<typename T>
class LocalVector
{
private:
    T* m_begin;
    T* m_end;

    size_t capacity;
    size_t length;
    __device__ void expand() {
        capacity *= 2;
        size_t tempLength = (m_end - m_begin);
        T* tempBegin = new T[capacity];

        memcpy(tempBegin, m_begin, tempLength * sizeof(T));
        delete[] m_begin;
        m_begin = tempBegin;
        m_end = m_begin + tempLength;
        length = static_cast<size_t>(m_end - m_begin);
    }
public:
    __device__  explicit LocalVector() : length(0), capacity(16) {
        m_begin = new T[capacity];
        m_end = m_begin;
    }
    __device__ T& operator[] (unsigned int index) {
        return *(m_begin + index);//*(begin+index)
    }
    __device__ T* begin() {
        return m_begin;
    }
    __device__ T* end() {
        return m_end;
    }
    __device__ ~LocalVector()
    {
        delete[] m_begin;
        m_begin = nullptr;
    }

    __device__ void add(T t) {

        if ((m_end - m_begin) >= capacity) {
            expand();
        }

        new (m_end) T(t);
        m_end++;
        length++;
    }
    __device__ T pop() {
        T endElement = (*m_end);
        delete m_end;
        m_end--;
        return endElement;
    }

    __device__ size_t getSize() {
        return length;
    }
};

- Robin Lew

1

即使它是正确的，由于expand()函数内存分配的原因，它很可能会变得很慢。不过，这是非常值得努力的。 - Subhodeep Maji

1

你不能在设备端代码中使用std::vector。为什么？

它没有标记允许这样做

“正式”的原因是，在您的设备端函数或内核中使用代码时，该代码本身必须在一个__device__函数中；而标准库中的代码，包括std::vector在内，不是这样的。（对于constexpr代码有一个例外；在C++20中，std::vector确实具有constexpr方法，但CUDA目前不支持C++20，此外，该constexprness受到有效限制。）

您可能真的不想这样做

std::vector类使用分配器来获取更多内存，以便在您创建或添加到向量时需要扩展存储。默认情况下（即如果您对某个T使用std::vector<T>）- 分配在堆上。虽然这可以适应GPU - 但速度会相当慢，如果每个“CUDA线程”都动态分配自己的内存，则速度会非常慢。

现在，您可能会说“但我不想分配内存，我只想从向量中读取！” - 好吧，在这种情况下，您不需要一个向量本身。只需将数据复制到某个设备上的缓冲区，并传递指针和大小，或者使用CUDA-capable span，例如cuda-kat。

另一个选择，虽然有点“沉重”，但是可以使用NVIDIA thrust library的device_vector类。在底层，它与标准库向量截然不同。

- einpoklum

device_vector 不能在设备代码中使用（但其迭代器可以）。目前，在 libcu++ 中有 span 和 mdspan 的官方实现。 - paleonix

@paleonix：有这样的项目，但我反对libcu++。NVIDIA不应该试图替换标准库，人们也没有理由在CUDA设备上尝试使用标准库。只有一些小片段，它们不应该假装是完整的标准库。 - einpoklum

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Edvard Pedersen · Accepted Answer

在CUDA中不能使用STL，但是您可以尝试使用Thrust库来完成您的需求。否则，只需将向量的内容复制到设备上并正常操作即可。