Visual C++中出现极其奇怪的代码生成，尽管代码几乎相同，但速度差异达到3倍之多。

Question

Visual C++中出现极其奇怪的代码生成，尽管代码几乎相同，但速度差异达到3倍之多。

c++performancevisual-c++code-generation

3

下面的代码（从我的大量代码中精简而来，当我惊讶于其速度与std::vector相比之后）具有两个奇特的特点：

- 当我对源代码进行非常微小的修改时（始终使用Visual C++ 2010的/O2编译），它比原代码运行速度快三倍以上。 - 使用/MTd比使用/MT快约20％，尽管输出循环看起来相同！

微小修改的汇编代码差异如下：

Loop without the modification (~300 ms):

00403383  mov         esi,dword ptr [esp+10h] 
00403387  mov         edx,dword ptr [esp+0Ch] 
0040338B  mov         dword ptr [edx+esi*4],eax 
0040338E  add         dword ptr [esp+10h],ecx 
00403392  add         eax,ecx 
00403394  cmp         eax,4000000h 
00403399  jl          main+43h (403383h)

Loop with /MTd (looks identical! but ~270 ms):

00407D73  mov         esi,dword ptr [esp+10h] 
00407D77  mov         edx,dword ptr [esp+0Ch] 
00407D7B  mov         dword ptr [edx+esi*4],eax 
00407D7E  add         dword ptr [esp+10h],ecx 
00407D82  add         eax,ecx 
00407D84  cmp         eax,4000000h 
00407D89  jl          main+43h (407D73h)

Loop with the modification (~100 ms!!):

00403361  mov         dword ptr [esi+eax*4],eax 
00403364  inc         eax  
00403365  cmp         eax,4000000h 
0040336A  jl          main+21h (403361h)

现在我的问题是，为什么上述更改会产生它们所产生的影响？这完全是奇怪的！特别是第一个更改 - 它根本不应该对任何东西产生影响（一旦您看到代码的差异），但它却大大降低了速度。这是有解释的吗？

#include <cstdio>
#include <ctime>
#include <algorithm>
#include <memory>
template<class T, class Allocator = std::allocator<T> >
struct vector : Allocator
{
    T *p;
    size_t n;
    struct scoped
    {
        T *p_;
        size_t n_;
        Allocator &a_;
        ~scoped() { if (p_) { a_.deallocate(p_, n_); } }
        scoped(Allocator &a, size_t n) : a_(a), n_(n), p_(a.allocate(n, 0)) { }
        void swap(T *&p, size_t &n)
        {
            std::swap(p_, p);
            std::swap(n_, n);
        }
    };
    vector(size_t n) : n(0), p(0) { scoped(*this, n).swap(p, n); }
    void push_back(T const &value) { p[n++] = value; }
};
int main()
{
    int const COUNT = 1 << 26;
    vector<int> vect(COUNT);
    clock_t start = clock();
    for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
    printf("time: %d\n", (clock() - start) * 1000 / CLOCKS_PER_SEC);
}

提示 (将鼠标悬停在下方)::

它与分配器有关。

答案:

将 Allocator &a_ 改为 Allocator a_。

- user541686

3

微小的修改是指什么？这是一个谜语还是一个真正的问题？ - CB Bailey

1

关于循环1和2之间的时间差，你是否进行了广泛的时间测量并计算了平均值？（正如您可能已经知道的那样，即使对于完全相同的代码，运行时间也可能因每次运行而异。） - Man of One Way

@ManofOneWay：是的，我有，它非常一致。 - user541686

5

显然，编译器在考虑可能的别名问题，但我仍然不确定你的问题实际上是关于什么的。 - CB Bailey

2

由于某种奇怪的原因，如果您删除std::swap(n_, n);这一行，则始终会得到快速循环。另外请注意，在vector构造函数中，您有两个名为n的变量：成员和参数！这相当令人困惑。 - rodrigo

显示剩余5条评论

3个回答

1

很奇怪，Allocator&会中断别名链，而Allocator不会。

你可以尝试

for(int i=vect.n; i<COUNT;++i){
    ...
}

强制i和n同步。

这将使vc更容易优化。

- lenx.wei

Lenx，我不确定我理解你的意思 - 这如何解释分配器的问题？（另外，这有点随机，但是...我们认识吗？！） - user541686

嗨，Mehrdad，很高兴在这里见到你 :) 哦，我明白了你的问题。对于vc2010来说真的很奇怪。 Allocator＆使vc无法识别n与i同步。也许你可以尝试一下： for(int i=vect.n; i < COUNT; i++) ... 看看它是否可以被正确优化。 - lenx.wei

哈哈，太好了，也在这里见到你！ :) 这是一个惊人的发现——我刚刚尝试了你提到的方法，它将时间从270毫秒缩短到了180毫秒！虽然还没有达到我看到的约90毫秒，但已经非常接近了。+1 我从来没有想过这个，感谢你指出来！ - user541686

0

嗯...看起来这是“最快”的代码

00403361  mov         dword ptr [esi+eax*4],eax 
00403364  inc         eax  
00403365  cmp         eax,4000000h 
0040336A  jl          main+21h (403361h)

代码有点过度优化了。在这个循环中，vect.n 完全被忽略了... 如果循环中发生了异常，vect.n 将无法正确更新。

所以答案可能是：当你使用 Allocator 时，vc 发现 vect.n 不会再被使用，因此可以忽略它。这很神奇，但通常并不那么有用，也很危险。

- lenx.wei

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Michael Burr · Accepted Answer

对于/MT和/MTd之间的差异，我猜想是/MTd的堆分配会为了调试目的而标记堆内存，从而更有可能被分页 - 这发生在你开始计时之前。

如果您“预热”向量分配，则/MT和/MTd将得到相同的数字。

vector<int> vect(COUNT);

// make sure vect's memory is warmed up
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
vect.n = 0; // clear the vector

clock_t start = clock();
for (int i = 0; i < COUNT; i++) { vect.push_back(i); }
printf("time: %d\n", (clock() - start) * 1000 / CLOCKS_PER_SEC);