如何使用SSE2优化这个Delphi函数？

Question

如何使用SSE2优化这个Delphi函数？

4

我需要一个提示，如何使用SSE2汇编（32位）实现这个Delphi函数。其他优化也是可以的。也许有人可以告诉我，可以使用什么样的指令，这样我就有了进一步阅读的起点。

const Precision = 10000;

// This function adds all Pixels into one. The pixels are weighted before adding. 
// A weight can range from 0 to "Precision". "Size" is typically 10 to 50.

function TFilter.Combine(Pixels: PByte; Weights: PCardinal; const Size: Cardinal): Cardinal;
var
  i, R, G, B, A: Cardinal;
begin
  B := Pixels^ * Weights^; Inc(Pixels);
  G := Pixels^ * Weights^; Inc(Pixels);
  R := Pixels^ * Weights^; Inc(Pixels);
  A := Pixels^ * Weights^; Inc(Pixels);
  Inc(Weights); // goto next weight
  for i := 1 to Size - 1 do
  begin
    Inc(B, Pixels^ * Weights^); Inc(Pixels);
    Inc(G, Pixels^ * Weights^); Inc(Pixels);
    Inc(R, Pixels^ * Weights^); Inc(Pixels);
    Inc(A, Pixels^ * Weights^); Inc(Pixels);
    Inc(Weights); // goto next weight
  end;
  B := B div Precision;
  G := G div Precision;
  R := R div Precision;
  A := A div Precision;

  Result := A shl 24 or R shl 16 or G shl 8 or B;
end;

期望的结果：

function TFilter.Combine(Pixels: PByte; Weights: PCardinal; const Size: Cardinal): Cardinal;
asm
  // Insert fast SSE2-Code here ;-)
end;

- Steffen Binas

5

请看一下GR32，看看它是否具有你需要的例程。如果没有，那么它有很多经过优化的SSE2代码，可以作为一个学习资源。 - David Heffernan

这一次组合了多少像素？我问这个问题是因为如果数量足够小，由于所有的开销，你不会看到任何显著的加速。另外，权重值需要是32位吗？16位是否足以包含它们？ - Multimedia Mike

权重值不必为32位，因为它们的范围只有10000（适合16位）的精度。 - Steffen Binas

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MBo · Accepted Answer

实现相当简单。我已经更改了您的函数原型 - 普通函数（而不是对象方法）。

这段代码的运行速度大约比按字节处理的函数快3倍（在256元素数组上进行1000000次迭代的时间为1500毫秒，在我的旧Athlon XP 2.2 GHz上大约为0.7 GB / sec）

function Combine(Pixels: PByte; Weights: PInteger; const Size: Cardinal): Integer;
//x86, register calling convention - three parameters in EAX, EDX, ECX
const
  Precision: Single = 1.0;
asm
  pxor XMM6, XMM6 //zero const
  pxor XMM4, XMM4 // zero accum

@@cycle:
  movd XMM1, [eax] //load color data
  movss XMM3, [edx]  //load weight

  punpcklbw XMM1, XMM6 //bytes to words
  shufps XMM3, XMM3, 0 // 4 x weight
  punpcklwd XMM1, XMM6 //words to ints
  cvtdq2ps XMM2, XMM3  //ints to singles
  cvtdq2ps XMM0, XMM1  //ints to singles

  mulps XMM0, XMM2    //data * weight
  addps XMM4, XMM0    //accum  = accum + data * weight

  add eax, 4        // inc pointers
  add edx, 4
  loop @@cycle

  movss XMM5, Precision
  shufps XMM5, XMM5, 0 // 4 x precision constant

  divps XMM4, XMM5    //accum/precision

  cvtps2dq XMM2, XMM4  //rounding singles to ints
  packssdw XMM2, XMM2 //ints to ShortInts
  packuswb XMM2, XMM2  //ShortInts to bytes

  movd eax, XMM2  //result
end;