.NET Framework中System.Numerics.Vector<T>的初始化性能

Question

.NET Framework中System.Numerics.Vector<T>的初始化性能

c#.net.net-coresimdsystem.numerics

3

System.Numerics.Vector为.NET Core和.NET Framework带来了SIMD支持。它适用于.NET Framework 4.6+和.NET Core。

// Baseline
public void SimpleSumArray() 
{
    for (int i = 0; i < left.Length; i++)
        results[i] = left[i] + right[i];
}

// Using Vector<T> for SIMD support
public void SimpleSumVectors() 
{
    int ceiling = left.Length / floatSlots * floatSlots;
    
    for (int i = 0; i < ceiling; i += floatSlots)
    {
        Vector<float> v1 = new Vector<float>(left, i);
        Vector<float> v2 = new Vector<float>(right, i);
        (v1 + v2).CopyTo(results, i);
    }
    for (int i = ceiling; i < left.Length; i++)
    {
        results[i] = left[i] + right[i];
    }
}

不幸的是，向量的初始化可能是限制步骤。为了解决这个问题，一些来源建议使用MemoryMarshal将源数组转换为向量数组[1][2]。例如：

// Improving Vector<T> Initialization Performance
public void SimpleSumVectorsNoCopy() 
{
    int numVectors = left.Length / floatSlots;
    int ceiling = numVectors * floatSlots;
    // leftMemory is simply a ReadOnlyMemory<float> referring to the "left" array
    ReadOnlySpan<Vector<float>> leftVecArray = MemoryMarshal.Cast<float, Vector<float>>(leftMemory.Span);
    ReadOnlySpan<Vector<float>> rightVecArray = MemoryMarshal.Cast<float, Vector<float>>(rightMemory.Span);
    Span<Vector<float>> resultsVecArray = MemoryMarshal.Cast<float, Vector<float>>(resultsMemory.Span);
    for (int i = 0; i < numVectors; i++)
        resultsVecArray[i] = leftVecArray[i] + rightVecArray[i];
}

这将在.NET Core上运行时带来显著的性能提升：

|                 Method |      Mean |     Error |    StdDev |
|----------------------- |----------:|----------:|----------:|
|         SimpleSumArray | 165.90 us | 0.1393 us | 0.1303 us |
|       SimpleSumVectors |  53.69 us | 0.0473 us | 0.0443 us |
| SimpleSumVectorsNoCopy |  31.65 us | 0.1242 us | 0.1162 us |

不幸的是，在 .NET Framework 上，这种初始化向量的方法产生了相反的效果。它实际上会导致更糟糕的性能：

|                 Method |      Mean |    Error |   StdDev |
|----------------------- |----------:|---------:|---------:|
|         SimpleSumArray | 152.92 us | 0.128 us | 0.114 us |
|       SimpleSumVectors |  52.35 us | 0.041 us | 0.038 us |
| SimpleSumVectorsNoCopy |  77.50 us | 0.089 us | 0.084 us |

有没有一种方法可以优化.NET Framework上Vector的初始化，并获得类似于.NET Core的性能？使用此示例应用程序[1]进行了测量。

[1] https://github.com/CBGonzalez/SIMDPerformance [2] https://stackoverflow.com/a/62702334/430935

- LTR

1

https://github.com/dotnet/corefxlab/issues/2581 ".NET Framework运行时/JIT不了解span"，所以你不应该期望MemoryMarshal会产生更好的结果。 - undefined

谢谢，这样就清楚了结果。所以跨度代码回退到一个通用的、可移植的实现，这个实现很慢，并且不能达到在.NET Core上相同代码所能达到的效果：重新解释数组为向量数组，而不进行复制。是否有任何解决方案，也许是不同的技巧，可以用来提高.NET Framework的性能？ - undefined

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- harold · Accepted Answer

据我所知，在.NET Framework 4.6或4.7中，唯一有效的向量加载方法（预计在5.0中会有所改变）是使用不安全代码，例如使用Unsafe.Read<Vector<float>>（如果适用，则使用其非对齐变体）：

public unsafe void SimpleSumVectors()
{
    int ceiling = left.Length / floatSlots * floatSlots;

    fixed (float* leftp = left, rightp = right, resultsp = results)
    {
        for (int i = 0; i < ceiling; i += floatSlots)
        {
            Unsafe.Write(resultsp + i, 
                Unsafe.Read<Vector<float>>(leftp + i) + Unsafe.Read<Vector<float>>(rightp + i));
        }
    }
    for (int i = ceiling; i < left.Length; i++)
    {
        results[i] = left[i] + right[i];
    }
}

这个使用了 System.Runtime.CompilerServices.Unsafe 包，你可以通过 NuGet 获取它，但也可以不用它。