在整数SSE寄存器中，将高位或低位64位移动的最快方法是什么？

Question

在整数SSE寄存器中，将高位或低位64位移动的最快方法是什么？

4

仅将整数SSE寄存器中的高位或低位64位快速移动到另一个寄存器，最快的方法是什么？使用SSE 4.1，可以通过单个指令（_mm_blend_epi16）完成。但是老版本的SSE怎么办？移位和解包？AND和OR？movsd是否有旁路延迟？

密切相关的问题：两个__m128i的64位部分的最佳混洗方式

- nwellnhof

3个回答

5

Agner Fog的汇编优化指南中有一组非常好的数据移动指令表（第13.3节）。

如果要将两个寄存器中的数据合并为一个，您可以选择以下选项：

MOVLHPS   # SSE. Low qword unchanged, high qword from low of source
MOVHLPS   # SSE. Low qword from high of source, high qword unchanged
MOVSD     # SSE2. Low qword from source (register only), high qword unchanged
# memory-source-only insns:
 MOVLPS/D  # SSE1/2.  Low qword from memory, high qword unchanged
 MOVHPS/D  # SSE1/2. High qword from memory, low qword unchanged
SHUFPD    # SSE2. Low qword from any position of destination. high qword from any position of source
PUNPCKLQDQ # SSE2. Low qword unchanged, high qword from low of source
PUNPCKHQDQ # SSE2. Low qword from high of destination, high qword from high of source
MOVQ       # SSE2. Low qword from source, high qword set to zero
PBLENDW    # SSE4.1
PINSRQ     # SSE4.1 (only takes the low64 of src)

以下是从Agner Fog的表格中复制粘贴的说明，版权归他所有。

因此，shufpd看起来是从另一个寄存器插入high64的最佳选择。其他选项需要它在src的low64中（例如punpcklqdq或movlhps）。

- Peter Cordes

关于 MOVSD，Intel Intrinsic Guide 表示未对齐的内存是可以的。_mm_load_sd 和 _mm_store_sd 都声明了 "mem_addr 不需要对齐到任何特定边界". 我猜编译器为内在函数的用户做了一些额外的工作。 - jww

1

@jww：movsd使用内存源时会零扩展到XMM寄存器中（是的，未对齐的地址也没问题，因为宽度小于16个字节）。使用寄存器源的movsd会将低半部分合并到目标中。如果要合并来自内存的低半部分，请使用movlps，这就是它的作用（而且它仅适用于内存源而不是寄存器源）。 - Peter Cordes

3

我不知道最快的方法，也许最简单的方法就是：

_mm_unpacklo_epi64(_mm_setzero_si128(), x)

[0, x0]

_mm_unpackhi_epi64(_mm_setzero_si128(), x)

[0, x1]

_mm_move_epi64(x)

[x0, 0]

_mm_unpackhi_epi64(x, _mm_setzero_si128())

[x1, 0]

- Johnny Cage

1

我想保留目标寄存器的剩余位。抱歉没有表达清楚。 - nwellnhof

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Stephen Canon · Accepted Answer

将src的低64位移动到dst中，保留dst的高64位：

movsd dst, src

将src的高64位移动到dst，保留dst的低64位：

shufps dst, src, E4h

绕过延迟通常只会增加延迟，而不是分派、执行或退役资源，因此只有在比较其他方面相等的序列时才需要考虑（即如果存在一个保留在整数域中的单指令等效项，则更喜欢使用它进行整数运算）。