OpenMP在GPU上的离线计算，针对'simd'的特殊性

Question

OpenMP在GPU上的离线计算，针对'simd'的特殊性

gpulanguage-lawyeropenmpamd-gpuoffloading

3

我在思考如何解释以下OpenMP结构：

#pragma omp target teams distribute parallel for
for(int i = 0; i < N; ++i) {
    // compute
}

#pragma omp target teams distribute parallel for simd
for(int i = 0; i < N; ++i) {
    // compute
}

注意第二个循环中添加的`simd`子句。根据OpenMP 5.1规范，该子句声明：“通过使用SIMD指令，可以并发地执行多个迭代循环”。

我相信我很好地理解了在CPU上如何实现和使用`simd`，但是在GPU上，更具体地说，在AMD GPU上，并没有像公开的SIMD指令那样的东西，因为HIP线程实际上就是一个SIMD指令通道。

根据OpenMP规范，如果存在循环依赖项，或者编译器无法证明不存在循环依赖项，则当OpenMP将团队映射到线程块/工作组和线程映射到SIMD通道时，它将强制使用仅包含一个线程的线程块。

你如何解释`target teams distribute parallel for simd`：

这是否意味着在此上下文中`simd`不能被翻译为GPU？
还是每个线程都被处理为只有单个SIMD通道？

至少有一个类似但旧且未回答的问题： How is omp simd for loop executed on GPUs?

- Etienne M

它可能被实际忽略了吗？您是否尝试使用 simd 开关运行程序？编译结果或性能是否有任何差异？ - Fra93

@Fra93，实际上我已经尝试过，在像上面展示的循环中（可能会有折叠），我没有看到任何性能下降或改进。我使用了amdclang和HPE-cray编译器。几个月前，cray编译器对'simd'子句+gpu卸载没有可行的支持（它强制每个线程块一个线程！）。 - Etienne M

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Etienne M · Accepted Answer

根据以下测试用例，针对AMD MI250 (gfx90a)生成的汇编代码与使用或不使用"simd"相同。不过，如果你查看CPU代码，你将会看到一个明显的变化，这种情况下，使用"simd"语句可以实现类似于明确使用"restrict"关键字的优化。

简而言之，目前"simd"语句是无关紧要的，并且即使在非常琐碎的情况下也会导致以下警告： "loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]"。

#include <cstdint>

#define RESTRICT __restrict

using Float = double;

void test0_0(Float* a, const Float* b) {
    a[0] = b[0] * b[0];
    // Forced store/reload (b[0] could be a[0]).
    a[1] = b[0];
}

void test0_1(Float* RESTRICT a, const Float* RESTRICT b) {
    a[0] = b[0] * b[0];
    // No forced store/reload.
    a[1] = b[0];
}

void test1_0(Float* a, Float* b, std::size_t length) {
#pragma omp parallel for
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // Forced store/reload
        a[i + 1] = b[i + 0];
    }
}

void test1_1(Float* a, Float* b, std::size_t length) {
#pragma omp parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // simd -> no loop carried dependencies:
        // No forced store/reload -> easier vectorization, less generated code.
        a[i + 1] = b[i + 0];
    }
}

void test2_0(Float* a, Float* b, std::size_t length) {
#pragma omp target teams distribute parallel for
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // ASM shows forced store/reload, as expected.
        a[i + 1] = b[i + 0];
    }
}

void test2_1(Float* RESTRICT a, Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

void test3_0(Float* a, const Float* b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

void test3_1(Float* RESTRICT a, const Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}
 test2_1(Float* RESTRICT a, Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

void test3_0(Float* a, const Float* b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

void test3_1(Float* RESTRICT a, const Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

代码可在以下链接找到：https://godbolt.org/z/sMY48s8jz