问题介绍
我正在尝试加速我正在编写的(2d)光线追踪器的交点代码。我正在使用C#和System.Numerics库来提高SIMD指令的速度。
问题是,我得到了奇怪的结果,有些速度飙升,而有些则相对较低。我的问题是,为什么一个速度飙升,而另一个速度相对较低?
背景:
- RayPack结构体是一系列(不同的)光线,打包在System.Numerics的向量中。
- BoundingBoxPack和CirclePack结构体是一个单独的bb / circle,打包在System.Numerics的向量中。
- 所使用的CPU是i7-4710HQ(Haswell),带有SSE 4.2,AVX(2)和FMA(3)指令。
- 以发布模式(64位)运行。该项目在.Net Framework 472中运行。未设置其他选项。
尝试
我已经尝试查找某些操作是否被正确支持或不支持(请注意:这些是针对c ++的。https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/或http://sci.tuomastonteri.fi/programming/sse),但似乎不是这种情况,因为我使用的笔记本电脑支持SSE 4.2。
在当前代码中,应用了以下更改:
- 使用更适当的指令(例如打包min)。
- 不使用float * vector指令(会导致许多额外的操作,请参见原始汇编)。
代码...片段?
很抱歉有大量的代码,但我不确定如果没有这么多代码我们如何具体讨论这个问题。
Ray -> BoundingBox的代码
public bool CollidesWith(Ray ray, out float t)
{
// https://gamedev.stackexchange.com/questions/18436/most-efficient-aabb-vs-ray-collision-algorithms
// r.dir is unit direction vector of ray
float dirfracx = 1.0f / ray.direction.X;
float dirfracy = 1.0f / ray.direction.Y;
// lb is the corner of AABB with minimal coordinates - left bottom, rt is maximal corner
// r.org is origin of ray
float t1 = (this.rx.min - ray.origin.X) * dirfracx;
float t2 = (this.rx.max - ray.origin.X) * dirfracx;
float t3 = (this.ry.min - ray.origin.Y) * dirfracy;
float t4 = (this.ry.max - ray.origin.Y) * dirfracy;
float tmin = Math.Max(Math.Min(t1, t2), Math.Min(t3, t4));
float tmax = Math.Min(Math.Max(t1, t2), Math.Max(t3, t4));
// if tmax < 0, ray (line) is intersecting AABB, but the whole AABB is behind us
if (tmax < 0)
{
t = tmax;
return false;
}
// if tmin > tmax, ray doesn't intersect AABB
if (tmin > tmax)
{
t = tmax;
return false;
}
t = tmin;
return true;
}
RayPack代码 -> BoundingBox打包
public Vector<int> CollidesWith(ref RayPack ray, out Vector<float> t)
{
// ------------------------------------------------------- \\
// compute the collision. \\
Vector<float> dirfracx = Constants.ones / ray.direction.x;
Vector<float> dirfracy = Constants.ones / ray.direction.y;
Vector<float> t1 = (this.rx.min - ray.origin.x) * dirfracx;
Vector<float> t2 = (this.rx.max - ray.origin.x) * dirfracx;
Vector<float> t3 = (this.ry.min - ray.origin.y) * dirfracy;
Vector<float> t4 = (this.ry.max - ray.origin.y) * dirfracy;
Vector<float> tmin = Vector.Max(Vector.Min(t1, t2), Vector.Min(t3, t4));
Vector<float> tmax = Vector.Min(Vector.Max(t1, t2), Vector.Max(t3, t4));
Vector<int> lessThanZeroMask = Vector.GreaterThan(tmax, Constants.zeros);
Vector<int> greaterMask = Vector.LessThan(tmin, tmax);
Vector<int> combinedMask = Vector.BitwiseOr(lessThanZeroMask, greaterMask);
// ------------------------------------------------------- \\
// Keep track of the t's that collided. \\
t = Vector.ConditionalSelect(combinedMask, tmin, Constants.floatMax);
return combinedMask;
}
Ray的代码 -> 圆形
public bool Intersect(Circle other)
{
// Step 0: Work everything out on paper!
// Step 1: Gather all the relevant data.
float ox = this.origin.X;
float dx = this.direction.X;
float oy = this.origin.Y;
float dy = this.direction.Y;
float x0 = other.origin.X;
float y0 = other.origin.Y;
float cr = other.radius;
// Step 2: compute the substitutions.
float p = ox - x0;
float q = oy - y0;
float r = 2 * p * dx;
float s = 2 * q * dy;
// Step 3: compute the substitutions, check if there is a collision.
float a = dx * dx + dy * dy;
float b = r + s;
float c = p * p + q * q - cr * cr;
float DSqrt = b * b - 4 * a * c;
// no collision possible! Commented out to make the benchmark more fair
//if (DSqrt < 0)
//{ return false; }
// Step 4: compute the substitutions.
float D = (float)Math.Sqrt(DSqrt);
float t0 = (-b + D) / (2 * a);
float t1 = (-b - D) / (2 * a);
float ti = Math.Min(t0, t1);
if(ti > 0 && ti < t)
{
t = ti;
return true;
}
return false;
}
RayPack的代码 -> CirclePack 可以在此处找到原始未编辑的代码:https://pastebin.com/87nYgZrv
public Vector<int> Intersect(CirclePack other)
{
// ------------------------------------------------------- \\
// Put all the data on the stack. \\
Vector<float> zeros = Constants.zeros;
Vector<float> twos = Constants.twos;
Vector<float> fours = Constants.fours;
// ------------------------------------------------------- \\
// Compute whether the ray collides with the circle. This \\
// is stored in the 'mask' vector. \\
Vector<float> p = this.origin.x - other.origin.x; ;
Vector<float> q = this.origin.y - other.origin.y;
Vector<float> r = twos * p * this.direction.x;
Vector<float> s = twos * q * this.direction.y; ;
Vector<float> a = this.direction.x * this.direction.x + this.direction.y * this.direction.y;
Vector<float> b = r + s;
Vector<float> c = p * p + q * q - other.radius * other.radius;
Vector<float> DSqrt = b * b - fours * a * c;
Vector<int> maskCollision = Vector.GreaterThan(DSqrt, zeros);
// Commented out to make the benchmark more fair.
//if (Vector.Dot(maskCollision, maskCollision) == 0)
//{ return maskCollision; }
// ------------------------------------------------------- \\
// Update t if and only if there is a collision. Take \\
// note of the conditional where we compute t. \\
Vector<float> D = Vector.SquareRoot(DSqrt);
Vector<float> bMinus = Vector.Negate(b);
Vector<float> twoA = twos * a;
Vector<float> t0 = (bMinus + D) / twoA;
Vector<float> t1 = (bMinus - D) / twoA;
Vector<float> tm = Vector.ConditionalSelect(Vector.LessThan(t1, t0), t1, t0);
Vector<int> maskBiggerThanZero = Vector.GreaterThan(tm, zeros);
Vector<int> maskSmallerThanT = Vector.LessThan(tm, this.t);
Vector<int> mask = Vector.BitwiseAnd(
maskCollision,
Vector.BitwiseAnd(
maskBiggerThanZero,
maskSmallerThanT)
);
this.t = Vector.ConditionalSelect(
mask, // the bit mask that allows us to choose.
tm, // the smallest of the t's.
t); // if the bit mask is false (0), then we get our original t.
return mask;
}
汇编代码
这些可以在pastebin上找到。请注意,基准测试工具中有一些样板汇编。您需要查看函数调用。
- BoundingBox(Pack):https://pastebin.com/RYMQdZMh
- Circle(Pack)优化版:https://pastebin.com/YZHjc1vY
- Circle(Pack)原始版:https://pastebin.com/87nYgZrv
基准测试
我一直在使用BenchmarkDotNet对情况进行基准测试。
Circle / CirclePack(更新后)的结果:
方法 | 平均值 | 误差 | 标准偏差 |
---|---|---|---|
交叉点 | 9.710 毫秒 | 0.0540 毫秒 | 0.0505 毫秒 |
IntersectionPacked | 3.296 毫秒 | 0.0055 毫秒 | 0.0051 毫秒 |
BoundingBox / BoundingBoxPacked的结果:
方法 | 平均值 | 误差 | 标准偏差 |
---|---|---|---|
交叉点 | 24.269 毫秒 | 0.2663 毫秒 | 0.2491 毫秒 |
IntersectionPacked | 1.152 毫秒 | 0.0229 毫秒 | 0.0264 毫秒 |
由于AVX,预期可以获得大约6倍到8倍的加速。边框的加速非常显著,而圆形的加速则相对较低。
重新审视顶部的问题:为什么一个速度提升非常高,而另一个速度提升却相对较低?另外,如何让速度提升较低的那个(CirclePack)变得更快?针对Peter Cordes的评论所做的修改:
- 使基准测试更加公平:单线程版本在光线不能再相撞时不会立即分支出去。现在速度提升大约是2.5倍。 - 将汇编代码作为单独的头文件添加。 - 关于平方根:它确实有影响,但并不像看起来的那么大。去掉向量平方根会将总时间缩短约0.3毫秒。单线程代码现在总是执行平方根。 - 关于C#中的FMA(Fused Multiply Addition)指令:我认为它对标量有效(请参见Can C# make use of fused multiply-add?),但我没有在System.Numerics.Vector结构中找到类似的操作。 - 关于打包最小值的C#指令:是的,它确实存在。我太傻了。我甚至已经使用过它了。
Vector.ConditionalSelect( Vector.LessThan(...), ...)
Does C# not have an intrinsic for packed-min? SSE/AVX hasminps
in hardware, but I'd worry that this source would trick the compiler into doing something horrible likecmpps
->blendvps
- Peter Cordes