TParallel.For默认线程池的奇怪行为

Question

TParallel.For默认线程池的奇怪行为

delphiparallel-processingrtl-ppl

19

我正在尝试 Delphi XE7 Update 1 的并行编程功能。

我创建了一个简单的 TParallel.For 循环，基本上是进行一些虚假操作来消磨时间。

我在 AWS 实例（c4.8xlarge）上启动程序，使用36个虚拟CPU来测试并行编程可能带来的收益。

当我第一次启动程序并执行 TParallel.For 循环时，我看到了显着的收益（虽然承认比我预期的36个虚拟CPU要少得多）：

Parallel matches: 23077072 in 242ms
Single Threaded matches: 23077072 in 2314ms

如果我不关闭程序并在36个vCPU机器上稍后再次运行该传递（例如，立即或10-20秒钟后），则并行传递会大大恶化：

Parallel matches: 23077169 in 2322ms
Single Threaded matches: 23077169 in 2316ms

如果我没有关闭程序，并且等几分钟（不是几秒钟，而是几分钟）再运行密码破解程序，我会再次得到第一次启动程序时的结果（响应时间提高了10倍）。

在36个虚拟CPU的计算机上，刚启动程序后的第一次运行总是比较快的，因此似乎只有在程序第二次调用TParallel.For函数时才会出现这种效果。

这是我正在运行的示例代码：

unit ParallelTests;

interface

uses
  Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
  System.Threading, System.SyncObjs, System.Diagnostics,
  Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    SingleThreadCheckBox: TCheckBox;
    ParallelCheckBox: TCheckBox;
    UnitsEdit: TEdit;
    Label1: TLabel;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}

procedure TForm1.Button1Click(Sender: TObject);
var
  matches: integer;
  i,j: integer;
  sw: TStopWatch;
  maxItems: integer;
  referenceStr: string;

 begin
  sw := TStopWatch.Create;

  maxItems := 5000;

  Randomize;
  SetLength(referenceStr,120000); for i := 1 to 120000 do referenceStr[i] := Chr(Ord('a') + Random(26)); 

  if ParallelCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    TParallel.For(1, MaxItems,
      procedure (Value: Integer)
        var
          index: integer;
          found: integer;
        begin
          found := 0;
          for index := 1 to length(referenceStr) do begin
            if (((Value mod 26) + ord('a')) = ord(referenceStr[index])) then begin
              inc(found);
            end;
          end;
          TInterlocked.Add(matches, found);
        end);
    sw.Stop;
    Memo1.Lines.Add('Parallel matches: ' + IntToStr(matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;

  if SingleThreadCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    for i := 1 to MaxItems do begin
      for j := 1 to length(referenceStr) do begin
        if (((i mod 26) + ord('a')) = ord(referenceStr[j])) then begin
          inc(matches);
        end;
      end;
    end;
    sw.Stop;
    Memo1.Lines.Add('Single Threaded matches: ' + IntToStr(Matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;
end;

end.

这是否是按设计工作的？我发现了这篇文章（http://delphiaball.co.uk/tag/parallel-programming/）建议我让库自己决定线程池，但如果我必须等待数分钟才能使请求更快地得到服务，使用并行编程的意义在哪里？

我对如何使用 TParallel.For 循环有什么遗漏吗？

请注意，我在 AWS m3.large 实例上无法重现此问题（根据 AWS，该实例有 2 个 vCPU）。在该实例上，我总是能够获得轻微的改进，并且在不久之后连续调用 TParallel.For 时也不会得到更差的结果。

Parallel matches: 23077054 in 2057ms
Single Threaded matches: 23077054 in 2900ms

看起来，这种情况发生在有许多可用核心（36个）时，这很遗憾，因为并行编程的整个重点是从许多核心中获益。我想知道这是否是库错误，因为核心计数高或在这种情况下核心计数不是2的幂。

更新：在AWS中使用不同vCPU计数的各种实例进行测试后，似乎出现了以下情况：

36 vCPUs（c4.8xlarge）。在普通TParallel调用之间必须等待几分钟（这使其在生产环境中无法使用）

32 vCPUs（c3.8xlarge）。在普通TParallel调用之间必须等待几分钟（这使其在生产环境中无法使用）

16 vCPUs（c3.4xlarge）。您必须等待次秒时间。如果负载低但响应时间仍然重要，则可以使用

8 vCPUs（c3.2xlarge）。它似乎正常工作

4 vCPUs（c3.xlarge）。它似乎正常工作

2 vCPUs（m3.large）。它似乎正常工作

- Pep

@Pep 如果你认为库是个问题，那就用另一个库写代码并进行比较。我怀疑库不是问题所在。 - David Heffernan

我进行了更多的测试。似乎在AWS上，当vCPUs>8时，Parallel库存在一些问题。当vCPU = 16时，它比vCPU = 32或36要好得多，但仍然存在问题。可能TParallel.For调用已经针对最多8个虚拟核心（桌面机器）的系统进行了优化。我将在问题中更新我的发现。 - Pep

这只是在AWS上使用问题中的代码进行黑盒测试。我刚刚发布了我在提到的AWS实例类型上使用Delphi XE7 Update 1获得的行为。 - Pep

1

所以，我认为该库不会针对某个核心数进行优化。但我认为该库很可能是问题的根源。与 OTL 的比较将使人们了解情况。但我必须说的是，新的 RTL 并行库真的很糟糕。这里已经有无数帖子揭示了它的实现非常糟糕。我怀疑我是否会使用它。我向您推荐OTL。 - David Heffernan

4

我认为很明显，在某一点上，你遇到了并行库中的一个bug，导致代码被串行执行。这肯定不是设计上的意图，也肯定不是调优错误造成的。它肯定是由于糟糕的实现所致。坦率地说，Embarcadero在编写正确的线程代码方面的记录非常糟糕。在“TMonitor”惨败之后，谁还能相信他们呢？ - David Heffernan

显示剩余6条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David Heffernan · Accepted Answer

我创建了两个测试程序，基于您的程序，用于比较 System.Threading 和 OTL。我使用的是XE7更新版1和OTL r1397。我使用的OTL源代码对应于3.04版本。使用32位Windows编译器进行构建，并使用发布构建选项。

我的测试机是一台双路英特尔 Xeon E5530运行Windows 7 x64的机器。该系统有两颗四核处理器，总计8个处理器，但由于超线程技术，系统显示有16个处理器。经验告诉我，超线程只是一种市场营销手段，我从未在这台机器上看到超过8倍的性能提升。

接下来是这两个几乎相同的程序:

System.Threading

program SystemThreadingTest;

{$APPTYPE CONSOLE}

uses
  System.Diagnostics,
  System.Threading;

const
  maxItems = 5000;
  DataSize = 100000;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  TParallel.For(1, maxItems,
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

OTL

program OTLTest;

{$APPTYPE CONSOLE}

uses
  Winapi.Windows,
  Winapi.Messages,
  System.Diagnostics,
  OtlParallel;

const
  maxItems = 5000;
  DataSize = 100000;

procedure ProcessThreadMessages;
var
  msg: TMsg;
begin
  while PeekMessage(Msg, 0, 0, 0, PM_REMOVE) and (Msg.Message <> WM_QUIT) do begin
    TranslateMessage(Msg);
    DispatchMessage(Msg);
  end;
end;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  Parallel.For(1, maxItems).Execute(
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  ProcessThreadMessages;

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

现在是输出结果。

System.Threading 输出

并行匹配：19230817，用时374ms
串行匹配：19230817，用时2423ms
并行匹配：19230698，用时374ms
串行匹配：19230698，用时2409ms
并行匹配：19230556，用时368ms
串行匹配：19230556，用时2433ms
并行匹配：19230635，用时2412ms
串行匹配：19230635，用时2430ms
并行匹配：19230843，用时2441ms
串行匹配：19230843，用时2413ms
并行匹配：19230905，用时2493ms
串行匹配：19230905，用时2423ms
并行匹配：19231032，用时2430ms
串行匹配：19231032，用时2443ms
并行匹配：19230669，用时2440ms
串行匹配：19230669，用时2473ms
并行匹配：19230811，用时2404ms
串行匹配：19230811，用时2432ms
....

OTL 输出

并行匹配：19230667，用时422ms
串行匹配：19230667，用时2475ms
并行匹配：19230663，用时335ms
串行匹配：19230663，用时2438ms
并行匹配：19230889，用时395ms
串行匹配：19230889，用时2461ms
并行匹配：19230874，用时391ms
串行匹配：19230874，用时2441ms
并行匹配：19230617，用时385ms
串行匹配：19230617，用时2524ms
并行匹配：19231021，用时368ms
串行匹配：19231021，用时2455ms
并行匹配：19230904，用时357ms
串行匹配：19230904，用时2537ms
并行匹配：19230568，用时373ms
串行匹配：19230568，用时2456ms
并行匹配：19230758，用时333ms
串行匹配：19230758，用时2710ms
并行匹配：19230580，用时371ms
串行匹配：19230580，用时2532ms
并行匹配：19230534，用时336ms
串行匹配：19230534，用时2436ms
并行匹配：19230879，用时368ms
串行匹配：19230879，用时2419ms
并行匹配：19230651，用时409ms
串行匹配：19230651，用时2598ms
并行匹配：19230461，用时357ms
....

我让 OTL 版本运行了很长时间，但模式从未改变。并行版本始终比串行版本快约7倍。

结论

代码非常简单，唯一合理的结论是 System.Threading 的实现存在缺陷。

新的 System.Threading 库已经出现了大量与 bug 相关的报告。所有迹象表明它的质量很差。Embarcadero 有发布次标准库代码的长期记录。我想到的是 TMonitor，XE3 字符串帮助程序，早期版本的 System.IOUtils，FireMonkey 等等。

很显然，Embarcadero 的质量存在很大问题。发布的代码明显没有经过充分的测试，如果有的话。对于一个线程库来说尤其麻烦，因为 bug 可以潜伏，并且只有在特定的硬件/软件配置中才会暴露出来。从 TMonitor 的经验来看，我认为 Embarcadero 没有足够的专业知识来生产高质量、正确的线程代码。

我的建议是，在目前的形式下不要使用 System.Threading。除非能够证明它具有足够的质量和正确性，否则应该被回避。我建议您使用 OTL。

编辑：程序的原始 OTL 版本存在实时内存泄漏问题，这是由于一些丑陋的实现细节导致的。Parallel.For 创建具有 .Unobserved 修饰符的任务。这会导致这些任务只有在某个内部消息窗口接收到“任务已终止”消息时才会被销毁。该窗口在与 Parallel.For 调用程序相同的线程中创建 - 即在本例中是主线程。由于主线程未处理消息，因此任务永远不会被销毁，内存消耗（以及其他资源）会不断积累。可能因此程序会在一段时间后挂起。