内置模数运算符（'%'）与自定义模数函数：如何提高模数运算的性能

Question

内置模数运算符（'%'）与自定义模数函数：如何提高模数运算的性能

8

最近，我了解到mod（'％'）运算符非常缓慢。因此，我创造了一个函数，它将像a％b一样工作。但它是否比mod运算符更快呢？

这是我的函数

int mod(int a, int b)
{
    int tmp = a/b;
    return a - (b*tmp);
}

- madMDT

12

请自行对比，但是你无法做得比编译器更好。 - President James K. Polk

1

我了解到取模（'%'）运算符非常慢 - 真的吗？您如何衡量它，与什么进行比较？ - Martin James

2

顺便说一句，有些处理器具有可以在一条指令中返回除法的商和余数的指令。 - Thomas Matthews

1

这是在许多处理器上调用一个“div”的操作。 - Martin James

我与其他数学运算符（如+/-）进行了比较。这个运算符不能像模运算符那样完成所有任务，但速度非常快。@ Martin James - madMDT

显示剩余6条评论

5个回答

10

这可能会受到编译器和平台的影响。

但是我对此很感兴趣，在我的系统上，根据我的基准测试，您似乎是正确的。但是@865719的答案中的方法最快：

#include <chrono>
#include <iostream>

class Timer
{
    using clk = std::chrono::steady_clock;
    using microseconds = std::chrono::microseconds;

    clk::time_point tsb;
    clk::time_point tse;

public:

    void clear() { tsb = tse = clk::now(); }
    void start() { tsb = clk::now(); }
    void stop() { tse = clk::now(); }

    friend std::ostream& operator<<(std::ostream& o, const Timer& timer)
    {
        return o << timer.secs();
    }

    // return time difference in seconds
    double secs() const
    {
        if(tse <= tsb)
            return 0.0;
        auto d = std::chrono::duration_cast<microseconds>(tse - tsb);
        return d.count() / 1000000.0;
    }
};

int mod(int a, int b)
{
    int tmp=a/b;
    return a-(b*tmp);
}

int fast_mod(const int input, const int ceil) {
    // apply the modulo operator only when needed
    // (i.e. when the input is greater than the ceiling)
    return input < ceil ? input : input % ceil;
    // NB: the assumption here is that the numbers are positive
}

int main()
{
    auto N = 1000000000U;
    unsigned sum = 0;

    Timer timer;

    for(auto times = 0U; times < 3; ++times)
    {
        std::cout << "     run: " << (times + 1) << '\n';

        sum = 0;
        timer.start();
        for(decltype(N) n = 0; n < N; ++n)
            sum += n % (N - n);
        timer.stop();

        std::cout << "       %: " << sum << " " << timer << "s" << '\n';

        sum = 0;
        timer.start();
        for(decltype(N) n = 0; n < N; ++n)
            sum += mod(n, N - n);
        timer.stop();

        std::cout << "     mod: " << sum << " " << timer << "s" << '\n';

        sum = 0;
        timer.start();
        for(decltype(N) n = 0; n < N; ++n)
            sum += fast_mod(n, N - n);
        timer.stop();

        std::cout << "fast_mod: " << sum << " " << timer << "s" << '\n';
    }
}

构建: GCC 5.1.1 (x86_64)

g++ -std=c++14 -march=native -O3 -g0 ...

输出：

     run: 1
       %: 3081207628 5.49396s
     mod: 3081207628 4.30814s
fast_mod: 3081207628 2.51296s
     run: 2
       %: 3081207628 5.5522s
     mod: 3081207628 4.25427s
fast_mod: 3081207628 2.52364s
     run: 3
       %: 3081207628 5.4947s
     mod: 3081207628 4.29646s
fast_mod: 3081207628 2.56916s

- Galik

你使用了什么编译器标志？ - edmz

@black 我在答案中加入了构建信息。 - Galik

@Galik 我的编译器无法编译你的代码。不过，你能否使用这个新的模数函数并告诉我时间呢？ int mod(int a, int b) { //int tmp=a/b; return a>b?a-(b*(a/b)):a; } - madMDT

@madMDT 你需要打开C++11来编译代码。在我的测试中，你的新函数与fast_mod差不多。 - Galik

这个性能测试可能会返回非常不精确的结果。我不会信任它们来得出关于计算在实际世界示例中表现如何的结论。 - Nikolai Ruhe

我尝试了取模运算符和您的快速取模函数来分解1万亿。取模运算符平均需要1.9秒。而快速取模函数平均需要3.7秒。 - user15114051

3

在某些情况下，程序员可以通过知道操作数的信息，超越编译器在余数运算中的表现。例如，如果基数可能是2的幂次方，但不太可能比要减小的值大，那么可以使用类似以下的方法：

unsigned mod(unsigned int x, unsigned int y)
{
  return y & (y-1) ? x % y : x & (y-1);
}

如果编译器将函数内联展开并且基数为2的常数幂，则编译器会使用按位AND替换余数运算符，这可能是一个重大改进。在基数不是2的常数幂的情况下，生成的代码需要进行一点计算，然后选择是否使用余数运算符，但在基数恰好是2的幂的情况下，按位AND的成本节省可能超过条件逻辑的成本。

另一个需要自定义模数函数的情况是当基数是固定常数，并且编译器没有提供计算余数的方法。例如，在可以执行快速整数移位和乘法的平台上计算x % 65521时，可以观察到计算 x -= (x>>16)*65521; 将使x变得更小，但不会影响x%65521的值。再做一次运算将把x减少到0..65745的范围内-足够小，以至于单个条件减法将产生正确的余数。

一些编译器可能知道如何使用这种技术有效地处理具有常数基数的%运算符，但对于那些不知道的编译器来说，这种方法可以是一种有用的优化，特别是在处理大于机器字长的数字时[请注意，65521是65536-15，在16位机器上，可以将x评估为x = (x & 65535) + 15*(x>>16)。不像减去65521*(x>>16)的形式那么可读，但可以轻松看出它如何在16位机器上高效处理。

- supercat

返回类型 unsigned？我和编译器以前从未见过。 - Akib Azmain Turja

1

@AkibAzmain：什么？它相当于“unsigned int”，而且我看到它比更长的形式使用得更多。 - supercat

1

我想在这个讨论中为大家贡献一点。如果你想处理负数，可以使用以下函数：

inline long long mod(const long long x, const long long y) {
    if (x >= y) {
        return x % y;
    } else if (x < 0) {
        return (x % y + y) % y;
    } else {
        return x;
    }
}

- ftgo

注意不要使用负数（模p）。在许多编程语言中，((-2)%5)!=(3%5)。因此，你可以为两个字符串计算相同的哈希值，但是当你比较它们时，它们看起来是不同的。为了避免这个问题，在代码中可以使用以下结构：x ← ((a%p)+p)%p，而不仅仅是x ← a%p。 - ftgo

0

大多数情况下，您的微优化代码不会比编译器更快。我也不知道那些声称内置%很慢的“智慧”来自哪里。它的速度与机器能够计算的速度一样快 - 借助编译器为您执行的所有微优化。

还要注意，对于这种非常小的代码片段的性能测量并不是一项容易的任务。循环结构的条件语句或时间测量的抖动可能会影响您的结果。您可以在像Andrei Alexantrescu或Chandler Caruth这样的人在YouTube上找到有关此类问题的讲座。我曾经为我正在工作的项目编写过一个微基准测试框架。确实有很多需要关注的事情，包括外部因素，例如操作系统抢占您的线程或将其移动到另一个核心。

- cdonat

2

你有没有尝试运行它？我在我的电脑上运行了几次，然后才问这个问题。我不知道为什么 mod(%) 很慢，但是减法(-)和除法(/)的组合会稍微快一点。 - madMDT

@madMDT 嗯，那么你的编译器供应商应该使用你的实现而不是天真的实现，甚至是更好的Chandler的实现。我期望下一个版本的clang这样做，因为这正是Chandler正在努力的方向。 - cdonat

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- maddouri · Accepted Answer

根据Chandler Carruth在CppCon 2015上的基准测试，最快的模运算符（在x86上，使用Clang编译时）是：

int fast_mod(const int input, const int ceil) {
    // apply the modulo operator only when needed
    // (i.e. when the input is greater than the ceiling)
    return input >= ceil ? input % ceil : input;
    // NB: the assumption here is that the numbers are positive
}

我建议您观看整个演讲，他会更详细地解释为什么这种方法比仅无条件使用%更快。