结构元组的性能表现

Question

结构元组的性能表现

c#performancef#tuplesalgebraic-data-types

21

以下的F#程序定义了一个函数，该函数返回以结构元组表示的两个整数对中较小的那个，并且运行需要1.4秒：

let [<EntryPoint>] main _ =
  let min a b : int = if a < b then a else b
  let min (struct(a1, b1)) (struct(a2, b2)) = struct(min a1 a2, min b1 b2)
  let mutable x = struct(0, 0)
  for i in 1..100000000 do
    x <- min x (struct(i, i))
  0

如果我将CIL反编译为C#，我会得到这段代码：

    public static int MinInt(int a, int b)
    {
        if (a < b)
        {
            return a;
        }
        return b;
    }

    public static System.ValueTuple<int, int> MinPair(System.ValueTuple<int, int> _arg2, System.ValueTuple<int, int> _arg1)
    {
        int b = _arg2.Item2;
        int a = _arg2.Item1;
        int b2 = _arg1.Item2;
        int a2 = _arg1.Item1;
        return new System.ValueTuple<int, int>(MinInt(a, a2), MinInt(b, b2));
    }

    public static void Main(string[] args)
    {
        System.ValueTuple<int, int> x = new System.ValueTuple<int, int>(0, 0);
        for (int i = 1; i <= 100000000; i++)
        {
            x = MinPair(x, new System.ValueTuple<int, int>(i, i));
        }
    }

使用C#编译器重新编译后，只需要0.3秒，比原始的F#快了4倍以上。

我无法理解为什么一个程序比另一个程序快得多。我甚至将两个版本反编译为CIL，也找不到任何明显的原因。从F#调用C#的Min函数会导致相同（较差）的性能。调用者内部循环的CIL代码完全相同。

有人可以解释这种显著的性能差异吗？

- J D

4

不确定为什么，但在编译为x86时运行时间为0.3秒，在编译为x64时运行时间为1.4秒。 - Antonín Lejsek

运行时开销？ - aybe

单次运行不足以得出任何结论。使用BenchmarkDotNet收集足够的有意义的数据，以便您进行比较。发布那些统计数据。 - Panagiotis Kanavos

我已经考虑了启动和运行多次迭代，当然。 - J D

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Just another metaprogrammer · Accepted Answer

你是否在相同的架构上运行这两个示例？我在x64上同时针对F#和C#代码获得了约1.4秒，而在x86上针对F#获得了约0.6秒，针对C#获得了约0.3秒。

正如你所说，反编译程序集时，代码看起来非常相似，但是当检查 IL 代码时会出现一些差异:

F# - let min (struct(a1, b1)) (struct(a2, b2)) ...

.maxstack 5
.locals init (
  [0] int32 b1,
  [1] int32 a1,
  [2] int32 b2,
  [3] int32 a2
)

IL_0000: ldarga.s _arg2
IL_0002: ldfld !1 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item2
IL_0007: stloc.0
IL_0008: ldarga.s _arg2
IL_000a: ldfld !0 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item1
IL_000f: stloc.1
IL_0010: ldarga.s _arg1
IL_0012: ldfld !1 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item2
IL_0017: stloc.2
IL_0018: ldarga.s _arg1
IL_001a: ldfld !0 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item1
IL_001f: stloc.3
IL_0020: nop
IL_0021: ldloc.1
IL_0022: ldloc.3
IL_0023: call int32 Program::min@8(int32, int32)
IL_0028: ldloc.0
IL_0029: ldloc.2
IL_002a: call int32 Program::min@8(int32, int32)
IL_002f: newobj instance void valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::.ctor(!0, !1)
IL_0034: ret

C# - MinPair

.maxstack 3
.locals init (
  [0] int32 b,
  [1] int32 b2,
  [2] int32 a2
)

IL_0000: ldarg.0
IL_0001: ldfld !1 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item2
IL_0006: stloc.0
IL_0007: ldarg.0
IL_0008: ldfld !0 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item1
IL_000d: ldarg.1
IL_000e: ldfld !1 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item2
IL_0013: stloc.1
IL_0014: ldarg.1
IL_0015: ldfld !0 valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::Item1
IL_001a: stloc.2
IL_001b: ldloc.2
IL_001c: call int32 PerfItCs.Program::MinInt(int32, int32)
IL_0021: ldloc.0
IL_0022: ldloc.1
IL_0023: call int32 PerfItCs.Program::MinInt(int32, int32)
IL_0028: newobj instance void valuetype [System.ValueTuple]System.ValueTuple`2<int32, int32>::.ctor(!0, !1)
IL_002d: ret

C#编译器的区别在于，它通过将中间结果推送到堆栈上来避免引入一些本地变量。由于本地变量无论如何都会分配在堆栈上，因此很难看出这为什么会导致更高效的代码。

其他函数非常类似。

反汇编x86得到以下结果：

F# - 循环

; F#
; struct (i, i) 
01690a7e 8bce            mov     ecx,esi
01690a80 8bd6            mov     edx,esi
; Loads x (pair) onto stack
01690a82 8d45f0          lea     eax,[ebp-10h]
01690a85 83ec08          sub     esp,8
01690a88 f30f7e00        movq    xmm0,mmword ptr [eax]
01690a8c 660fd60424      movq    mmword ptr [esp],xmm0
; Push new tuple on stack
01690a91 52              push    edx
01690a92 51              push    ecx
; Loads pointer to x into ecx (result will be written here)
01690a93 8d4df0          lea     ecx,[ebp-10h]
; Call min
01690a96 ff15744dfe00    call    dword ptr ds:[0FE4D74h]
; Increase i
01690a9c 46              inc     esi
01690a9d 81fe01e1f505    cmp     esi,offset FSharp_Core_ni+0x6be101 (05f5e101)
; Reached the end?
01690aa3 7cd9            jl      01690a7e

C# - 循环

; C#
; Loads x (pair) into ecx, eax
02c2057b 8d55ec          lea     edx,[ebp-14h]
02c2057e 8b0a            mov     ecx,dword ptr [edx]
02c20580 8b4204          mov     eax,dword ptr [edx+4]
; new System.ValueTuple<int, int>(i, i) 
02c20583 8bfe            mov     edi,esi
02c20585 8bd6            mov     edx,esi
; Push x on stack
02c20587 50              push    eax
02c20588 51              push    ecx
; Push new tuple on stack
02c20589 52              push    edx
02c2058a 57              push    edi
; Loads pointer to x into ecx (result will be written here)
02c2058b 8d4dec          lea     ecx,[ebp-14h]
; Call MinPair
02c2058e ff15104d2401    call    dword ptr ds:[1244D10h]
; Increase i
02c20594 46              inc     esi
; Reached the end?
02c20595 81fe00e1f505    cmp     esi,5F5E100h
02c2059b 7ede            jle     02c2057b

很难理解为什么 F# 代码应该表现得更差。除了在如何加载栈上的 x 的异常情况下，代码看起来大致相同。在有人想出一个好的解释之前，我只能猜测这是因为 movq 的延迟比 push 更糟糕，由于所有指令都操作堆栈，CPU 无法重排指令以减轻 movq 的延迟。

至于为什么 JIT 编译器选择在 F# 代码中使用 movq 而不是 C# 代码，我目前不知道原因。

对于 x64，性能似乎会变差，因为方法前导部分的开销更大，并且由于别名而更容易出现停顿。这主要是我的推测，但从汇编代码中很难看出除了停顿之外，什么会将 x64 的性能降低 4 倍。

通过将 min 标记为内联方式（inline），x64 和 x86 的运行时间约为 0.15 秒。这并不奇怪，因为它可以消除方法前导部分的所有开销，以及对堆栈的大量读写。

将 F# 方法标记为积极内联（[MethodImpl(MethodImplOptions.AggressiveInlining)]）不起作用，因为 F# 编译器会删除所有这样的属性，意味着 JIT 编译器永远看不到它。但将 C# 方法标记为积极内联可以使 C# 代码运行在约 0.15 秒的时间内。

所以最终，x86 JIT 编译器出于某种原因选择了不同的 jit 代码，即使 IL 代码看起来非常相似。可能方法上的属性会影响 JIT 编译器，因为它们有点不同。

x64 JIT 编译器可能可以更好地将参数推送到堆栈上，以更有效地进行处理。我猜使用 push 作为 x86 JIT 更可取，因为 push 的语义更受限制，但这只是我的猜测。

像这种方法很廉价的情况下，将它们标记为内联是有好处的。

说实话，我不确定这是否对 OP 有帮助，但希望这些信息会有点意思。

PS：我在 i5 3570K 上的 .NET 4.6.2 上运行此代码。