Parallel.For为什么在这种情况下会导致巨大的上下文切换开销？

Question

Parallel.For为什么在这种情况下会导致巨大的上下文切换开销？

c#.netparallel-processing

3

我正在尝试使用.NET 4中的新并行工具，通过蒙特卡罗方法计算Pi值。

（实际算法并不是很重要，但为了清晰起见，在此说明：

在单位正方形内选择numIterations个随机点。
计算这些点中有多少个位于由该正方形界定的圆内（即距正方形中心小于0.5的点）。
然后，对于非常大的numIterations，PI=4 * iterationsInsideCircle / numIterations。）

我有一个int ThrowDarts(int numDarts)方法，它在单位正方形（如上所述）内选择numDarts个随机点，并返回位于单位圆内的点数：

    protected static int ThrowDarts(int iterations)
    {
        int dartsInsideCircle = 0;
        Random random = new Random();
        for (int iteration = 0; iteration < iterations; iteration++)
        {
            double pointX = random.NextDouble() - 0.5;
            double pointY = random.NextDouble() - 0.5;

            double distanceFromOrigin = Math.Sqrt(pointX*pointX + pointY*pointY);
            bool pointInsideCircle = distanceFromOrigin <= 0.5;

            if (pointInsideCircle)
            {
                dartsInsideCircle++;
            }
        }
        return dartsInsideCircle;
    }

基本上，在我的不同实现中（每个实现都使用不同的并行机制），我正在编写将飞镖抛入圆圈内并计数的不同方式。

例如，我的单线程实现只是：

    protected override int CountInterationsInsideCircle()
    {
        return ThrowDarts(_numInterations);
    }

我也有一种用于我的并行算法之一的方法：

    protected override int CountInterationsInsideCircle()
    {
        Task<int>[] tasks = new Task<int>[_numThreads];

        for (int i = 0; i < _numThreads; i++)
        {
            tasks[i] = Task.Factory.StartNew(() => ThrowDarts(_numInterations/_numThreads));
        }

        int iterationsInsideCircle = 0;
        for (int i = 0; i < _numThreads; i++)
        {
            iterationsInsideCircle += tasks[i].Result;
        }

        return iterationsInsideCircle;
    }

希望你能理解这个情况。

这里，我遇到了一个难题。我正在编写的 Parallel.For 版本会导致大量的上下文切换。以下是代码：

    protected override int CountInterationsInsideCircle()
    {
        ConcurrentBag<int> results = new ConcurrentBag<int>();
        int result = 0;

        Parallel.For(0, _numInterations,
                     // initialise each thread by setting it's hit count to 0
                     () => 0,
                     //in the body, we throw one dart and see whether it hit or not
                     (iteration, state, localState) => localState + ThrowDarts(1),
                     // finally, we sum (in a thread-safe way) all the hit counts of each thread together
                     results.Add);

        foreach(var threadresult in results)
        {
            result+=threadresult;
        }

        return result;
    }

使用 Parallel.For 版本确实可以工作，但速度非常慢，因为前面提到的上下文切换（在前两种方法中不会发生）。

有没有人能够告诉我为什么会发生这种情况？

- AlexC

机器有多少个处理器？ - Naraen

ThrowDarts(1)? 将数字调高，这样您就不会因为执行几纳秒的代码而耗尽线程。 - Hans Passant

4个回答

0

更新 忍不住在家里的个人电脑上（Linux 32位，Q9550）也进行了相同的基准测试，使用了Mono 2.8.2，只是为了好玩。:

[mono] /tmp @ dmcs MonteCarlo.cs 
[mono] /tmp @ time mono ./MonteCarlo.exe 
Yo
Approx: 392711899/500000000 => Pi: 3.141695192

real    0m28.109s
user    0m27.966s
sys 0m0.152s
[mono] /tmp @ dmcs MonteCarlo.cs # #define PARALLEL added
[mono] /tmp @ time mono ./MonteCarlo.exe 
Yo
Approx: 392687018/500000000 => Pi: 3.141496144

real    0m8.139s
user    0m31.506s
sys 0m0.064s

所以是的，它似乎按预期进行扩展。感谢你让我在mono上真正地将其投入“使用”。这个任务在我的“待办事项”清单上已经有很长时间了，而且它运行得非常好！

原始帖子

我刚刚在双核（E5300）的Windows XP上使用mono 2.8.2计时。

使用并行版本（#define PARALLEL），它运行了40秒。

使用顺序版本（不定义PARALLEL），大约需要45秒。

所以我没有看到你测量到的额外开销；或者至少我没有看到减速。我也没有像你一样看到加速度。

在并行运行中，我看到两个CPU都达到了100%的利用率，而单线程版本平均使用了大约50%的两个CPU。

#define PARALLEL
using System;
using System.IO;
using System.Text.RegularExpressions;
using System.Collections.Concurrent;
using System.Threading.Tasks;
namespace test
{
    class MainClass
    {
        const int _numInterations = 50000;
        const int _dartsPerIter = 10000;

        protected static int ThrowDarts (int iterations)
        {
            Random random = new Random ();
            int dartsInsideCircle = 0;
            for (int iteration = 0; iteration < iterations; iteration++) {
                double pointX = random.NextDouble () - 0.5;
                double pointY = random.NextDouble () - 0.5;

                double distanceFromOrigin = Math.Sqrt (pointX * pointX + pointY * pointY);
                bool pointInsideCircle = distanceFromOrigin <= 0.5;

                if (pointInsideCircle) {
                    dartsInsideCircle++;
                }
            }
            return dartsInsideCircle;
        }
        protected int CountInterationsInsideCircle ()
        {
            ConcurrentBag<int> results = new ConcurrentBag<int> ();
            int result = 0;

            // initialise each thread by setting it's hit count to 0
            //in the body, we throw one dart and see whether it hit or not
            // finally, we sum (in a thread-safe way) all the hit counts of each thread together
#if PARALLEL
            Parallel.For (0, _numInterations, () => 0, (iteration, state, localState) => localState + ThrowDarts (_dartsPerIter), results.Add);
#else
            for (var i =0; i<_numInterations; ++i)
                results.Add(ThrowDarts (_dartsPerIter));
#endif

            foreach (var threadresult in results) {
                result += threadresult;
            }

            return result;
        }
        public static void Main (string[] args)
        {
            Console.WriteLine("Yo");
            var inside = new MainClass ().CountInterationsInsideCircle ();
            Console.WriteLine("Approx: {0}/{1} => Pi: {2}",
                               inside, _numInterations * _dartsPerIter,
                               (4.0*inside)/(1.0*_numInterations*_dartsPerIter));
        }
    }
}

- sehe

0

相较于其他的实现方式，这个并行循环采取了一种猜测的方法 - 它使用了共享的结果集，而不是在本地跟踪结果然后在最后进行合并。这样做虽然保证了线程安全性，但付出了更大的代价，更不用说还可能遭受缓存行竞争的困扰了（http://msdn.microsoft.com/en-us/magazine/cc872851.aspx）。

- Mark Sowul

我对我写的方式的理解是，每个线程投掷一些飞镖并将结果添加到其本地状态中。然后，在框架完成线程后，它将其结果添加到并发包中。然后，一旦 Parallel.For 终止，它会在主线程上汇总所有结果。据我所见，除了不可避免的并发包之外，这没有共享的结果，并且不应该遭受太多缓存命中。 - AlexC

啊，是的，我搞混了重载函数。 - Mark Sowul

1

顺便说一句，ThrowDarts(1)将会有很多额外开销，并且由于每次创建新的Random而导致结果不正确。这可能取决于它们是否使用全局状态。不正确性将是因为在同一个时间量子（毫秒？）内创建的Random给出相同的序列。 - Mark Sowul

刚刚发布了自己的答案，然后看到了这个。这正是问题所在，请看下面我的新实现。谢谢，Alex - AlexC

0

当 `_numThreads == _numIterations` 时，你的手动任务（`Task`）会发生什么？第一种方法会将其分成 `_numThreads` 部分，而 `Parallel.For` 版本则会始终创建 `_numIterations` 个任务，每个任务只包含一个迭代。这取决于迭代数量，可能会压垮线程池并抵消并行性的任何好处，因为它需要竞争线程池及其相关锁定的开销。

当每个操作相对昂贵且可以独立计算时，`Parallel.For` 很适合使用。但是在这种情况下运行单个迭代的计算是一项便宜的操作，因此开销开始支配每个任务的时间。您可以通过使用 `_numThreads` 和 `_numIterations / _numThreads` 来使 `Parallel.For` 版本与您的手动任务版本等效，就像您为手动任务版本所做的一样。

- Dan Bryant

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- AlexC · Accepted Answer

我已经找到了这个问题的解决方法。

之前，在我的ThrowDarts方法中，我每次调用都会创建一个新的Random（这是因为Random类不是线程安全的）。

然而，事实证明，这是相对昂贵的。（至少在仅执行一次飞镖投掷时是如此，因此我们为每次迭代生成一个新的Random。）

因此，我修改了我的ThrowDarts方法，使其接受一个由调用者创建的Random，并修改了我的LoopState，使其包含自己的Random。

因此，在Parallel.For中的每个线程都包含自己的Random。我的新实现如下：

    protected override int CountInterationsInsideCircle()
    {
        ConcurrentBag<int> results = new ConcurrentBag<int>();
        Parallel.For(0, _numInterations,
                     // initialise each thread by setting it's hit count to 0
                     () => new LoopThreadState(),
                     // in the body, we throw one dart and see whether it hit or not
                     (iteration, _, localState) =>
                        {
                            localState.Count += ThrowDarts(1, localState.RandomNumberGenerator);
                            return localState;
                        },
                     // finally, we sum (in a thread-safe way) all the hit counts of each thread together
                     result => results.Add(result.Count));

        int finalResult = 0;
        foreach (int threadresult in results)
        {
            finalResult += threadresult;
        }

        return finalResult;
    }

我想上下文切换指标有点误导了，一个简单的性能分析就可以解决问题。不错的曲线球，.NET，不错。总之，我们学到了教训！

感谢大家， Alex