OpenMP在GPU上的离线计算,针对'simd'的特殊性

3

我在思考如何解释以下OpenMP结构:

#pragma omp target teams distribute parallel for
for(int i = 0; i < N; ++i) {
    // compute
}

#pragma omp target teams distribute parallel for simd
for(int i = 0; i < N; ++i) {
    // compute
}

注意第二个循环中添加的`simd`子句。根据OpenMP 5.1规范,该子句声明:“通过使用SIMD指令,可以并发地执行多个迭代循环”。
我相信我很好地理解了在CPU上如何实现和使用`simd`,但是在GPU上,更具体地说,在AMD GPU上,并没有像公开的SIMD指令那样的东西,因为HIP线程实际上就是一个SIMD指令通道。
根据OpenMP规范,如果存在循环依赖项,或者编译器无法证明不存在循环依赖项,则当OpenMP将团队映射到线程块/工作组和线程映射到SIMD通道时,它将强制使用仅包含一个线程的线程块。
你如何解释`target teams distribute parallel for simd`:
  • 这是否意味着在此上下文中`simd`不能被翻译为GPU?
  • 还是每个线程都被处理为只有单个SIMD通道?
至少有一个类似但旧且未回答的问题: How is omp simd for loop executed on GPUs?

它可能被实际忽略了吗?您是否尝试使用 simd 开关运行程序?编译结果或性能是否有任何差异? - Fra93
@Fra93,实际上我已经尝试过,在像上面展示的循环中(可能会有折叠),我没有看到任何性能下降或改进。我使用了amdclang和HPE-cray编译器。几个月前,cray编译器对'simd'子句+gpu卸载没有可行的支持(它强制每个线程块一个线程!)。 - Etienne M
1个回答

1
根据以下测试用例,针对AMD MI250 (gfx90a)生成的汇编代码与使用或不使用"simd"相同。不过,如果你查看CPU代码,你将会看到一个明显的变化,这种情况下,使用"simd"语句可以实现类似于明确使用"restrict"关键字的优化。
简而言之,目前"simd"语句是无关紧要的,并且即使在非常琐碎的情况下也会导致以下警告: "loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]"。
#include <cstdint>

#define RESTRICT __restrict

using Float = double;

void test0_0(Float* a, const Float* b) {
    a[0] = b[0] * b[0];
    // Forced store/reload (b[0] could be a[0]).
    a[1] = b[0];
}

void test0_1(Float* RESTRICT a, const Float* RESTRICT b) {
    a[0] = b[0] * b[0];
    // No forced store/reload.
    a[1] = b[0];
}

void test1_0(Float* a, Float* b, std::size_t length) {
#pragma omp parallel for
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // Forced store/reload
        a[i + 1] = b[i + 0];
    }
}

void test1_1(Float* a, Float* b, std::size_t length) {
#pragma omp parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // simd -> no loop carried dependencies:
        // No forced store/reload -> easier vectorization, less generated code.
        a[i + 1] = b[i + 0];
    }
}

void test2_0(Float* a, Float* b, std::size_t length) {
#pragma omp target teams distribute parallel for
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // ASM shows forced store/reload, as expected.
        a[i + 1] = b[i + 0];
    }
}

void test2_1(Float* RESTRICT a, Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

void test3_0(Float* a, const Float* b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

void test3_1(Float* RESTRICT a, const Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}
 test2_1(Float* RESTRICT a, Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

void test3_0(Float* a, const Float* b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

void test3_1(Float* RESTRICT a, const Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

代码可在以下链接找到:https://godbolt.org/z/sMY48s8jz


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接