我已经考虑了以下测试程序。
template <typename T> __device__ void inline swap_test_device1(T& a, T& b)
{
T c(a); a=b; b=c;
}
template <typename T> __device__ void inline swap_test_device2(T a, T b)
{
T c(a); a=b; b=c;
}
__global__ void swap_test_global(const int* __restrict__ input1, const int* __restrict__ input2, int* output1, int* output2) {
int tx = threadIdx.x + blockIdx.x * blockDim.x;
int x = input1[tx]*input1[tx];
int y = input2[tx]*input2[tx];
swap_test_device1(x,y);
output1[tx] = x;
output2[tx] = y;
}
我已经对其进行了反汇编。当使用swap_test_device1
和swap_test_device2
时,结果相同。共同的反汇编代码如下:
MOV R1, c[0x1][0x100]
S2R R0, SR_CTAID.X
S2R R2, SR_TID.X
MOV32I R9, 0x4
IMAD R3, R0, c[0x0][0x8], R2
IMAD R6.CC, R3, R9, c[0x0][0x28]
IMAD.HI.X R7, R3, R9, c[0x0][0x2c]
IMAD R10.CC, R3, R9, c[0x0][0x20]
LD.E R2, [R6]
IMAD.HI.X R11, R3, R9, c[0x0][0x24]
IMAD R4.CC, R3, R9, c[0x0][0x30]
LD.E R0, [R10]
IMAD.HI.X R5, R3, R9, c[0x0][0x34]
IMAD R8.CC, R3, R9, c[0x0][0x38]
IMAD.HI.X R9, R3, R9, c[0x0][0x3c]
IMUL R2, R2, R2
ST.E [R4], R2
IMUL R0, R0, R0
ST.E [R8], R0
EXIT
在反汇编的代码中似乎没有明确的交换。换句话说,编译器对于这个简单的例子,能够直接优化代码,将 x
和 y
写入正确的全局内存位置。
编辑
现在我考虑了以下更复杂的测试用例
__global__ void swap_test_global(const char* __restrict__ input1, const char* __restrict__ input2, char* output1, char* output2) {
int tx = threadIdx.x + blockIdx.x * blockDim.x;
char x = input1[tx];
char y = input2[tx];
swap_test_device1(x,y);
output1[tx] = (x >> 3) & y;
output2[tx] = (y >> 5) & x;
}
使用与上述
__device__
函数相同的函数。反汇编代码为:
MOV R1, c[0x1][0x100]
S2R R0, SR_CTAID.X
S2R R2, SR_TID.X
IMAD R0, R0, c[0x0][0x8], R2
BFE R7, R0, 0x11f
IADD R8.CC, R0, c[0x0][0x28]
IADD.X R9, R7, c[0x0][0x2c]
IADD R10.CC, R0, c[0x0][0x20]
LD.E.S8 R4, [R8]
IADD.X R11, R7, c[0x0][0x24]
IADD R2.CC, R0, c[0x0][0x30]
LD.E.S8 R5, [R10]
IADD.X R3, R7, c[0x0][0x34]
IADD R12.CC, R0, c[0x0][0x38]
IADD.X R13, R7, c[0x0][0x3c]
SHR.U32 R0, R4, 0x3
SHR.U32 R6, R5, 0x5
LOP.AND R5, R0, R5
LOP.AND R0, R6, R4
ST.E.U8 [R2], R5
ST.E.U8 [R12], R0
EXIT
正如你所看到的,仍然没有明显的寄存器交换。