迭代性能：Java与C++比较

Question

迭代性能：Java与C++比较

9

这件事让我困惑了3天。我有一个应用程序需要评估一组非常少元素的整数多项式（多个参数）。我已经用Java编写了一个实现，现在正在移植到C++。

在测试过程中，我注意到C++版本比Java版本慢了几个数量级。当然，我知道JIT和这种编译器场景很匹配，但我看到的与我预期的差距太大了。

以下是示例代码，您需要使用boost来编译C++代码（但只有在进行简单的时间测量时才需要此依赖项）。

choeger@daishi ~/uebb % clang++ -O3 -std=c++11 polytest.cpp -lboost_timer -lboost_system
choeger@daishi ~/uebb % ./a.out                                                         
 0.011694s wall, 0.010000s user + 0.000000s system = 0.010000s CPU (85.5%)
Ideal Result: 1e+07
 0.421986s wall, 0.420000s user + 0.000000s system = 0.420000s CPU (99.5%)
Result: 1e+07
choeger@daishi ~/uebb % javac PolyTest.java                                             
choeger@daishi ~/uebb % java PolyTest                                                   
evals: 10000000 runtime: 17ms
Ideal Result: 1.0E7
evals: 10000000 runtime: 78ms
Result: 1.0E7

显然，使用clang-3.3编译的C++版本在纯计算能力方面运行略快，但是当多项式被解释时，Java(openjdk 1.7.0.60)表现更优。我猜测，这是因为我的C++代码由于迭代小向量（在示例中为1个元素）而不太优化。我假设JVM在缓存命中/未命中方面做得更好。

有没有办法让我的C++版本表现更好？还有其他原因我没看到吗？顺便说一句：有没有办法衡量C++和Java进程的缓存一致性？

C++代码如下：

#include <boost/timer/timer.hpp>

#include <iostream>
#include <vector>

using namespace std;

struct Product {
  int factor;
  vector<int> fields;
};


class SumOfProducts {
public:
  vector<Product> sum;

  /**
   * evaluate the polynomial with arguments separated by width
   */
  inline double eval(const double* arg, const int width) const {
    double res = 0.0;
    for (Product p : sum) {
      double prod = p.factor;
      for (int f : p.fields) {
    prod *= arg[f*width];
      }
      res += prod;
    }
    return res;
  };
};


double idealBenchmark(const double* arg, const int width) {
  boost::timer::auto_cpu_timer t;
  double res = 0.0;
  // run 10M evaluations 
  for (long l = 0; l < 10000000; l++) {
    res = res + arg[width] * arg[width];
  }
  return res;
}

double benchmark(const double* arg, const SumOfProducts& poly) {
  boost::timer::auto_cpu_timer t;
  double res = 0.0;
  // run 10M evaluations 
  for (long l = 0; l < 10000000; l++) {
    res = res + poly.eval(arg, 1);
  }
  return res;
}

int main() {
  //simple polynomial: x_1^2
  Product p;
  p.factor = 1;
  p.fields.push_back(1);
  p.fields.push_back(1);

  SumOfProducts poly;
  poly.sum.push_back(p);

  double arg[] = { 0, 1 };

  double res = idealBenchmark(arg, 1);
  cout << "Ideal Result: " << res << endl;

  res = benchmark(arg, poly);
  cout << "Result: " << res << endl;
}

Java版本如下：

public class PolyTest {

    static class Product {
    public final int factor;
    public final int[] fields;

    public Product(int pFactor, int[] pFields) {
        factor = pFactor;
        fields = pFields;
    }
    }

    static class SumOfProducts {
    final Product[] sum;

    public SumOfProducts(Product[] pSum) {
        sum = pSum;
    }

    /**
     * evaluate the polynomial with arguments separated by width
     */
    double eval(final double[] arg, final int width) {
        double res = 0.0;
        for (Product p : sum) {
        double prod = p.factor;
        for (int f : p.fields) {
            prod *= arg[f*width];
        }
        res += prod;
        }
        return res;
    }
    }

    static double idealBenchmark(final double[] arg, final int width) {
    final long start = System.currentTimeMillis();
    double res = 0.0;
    long evals = 0;
    // run 10M evaluations 
    for (long l = 0; l < 10000000; l++) {
        evals++;
        res = res + arg[width] * arg[width];
    }
    System.out.println("evals: " + evals + " runtime: " + (System.currentTimeMillis() - start) + "ms");
    return res;
    }

    static double benchmark(final double[] arg, final SumOfProducts poly) {
    final long start = System.currentTimeMillis();
    double res = 0.0;
    long evals = 0;
    // run 10M evaluations 
    for (long l = 0; l < 10000000; l++) {
        evals++;
        res = res + poly.eval(arg, 1);
    }
    System.out.println("evals: " + evals + " runtime: " + (System.currentTimeMillis() - start) + "ms");
    return res;
    }

    public static void main(String[] args) {
    //simple polynomial: x_1^2
    Product p = new Product(1, new int[]{1, 1});

    SumOfProducts poly = new SumOfProducts(new Product[]{p});

    double arg[] = { 0, 1 };

    double res = idealBenchmark(arg, 1);
    System.out.println("Ideal Result: " + res);

    res = benchmark(arg, poly);
    System.out.println("Result: " + res);
    }

}

- choeger

4

你的Java代码使用数组，而C++代码使用向量。这不是一个公平的比较。 - mah

数量级 - 那至少比...慢100倍？ - suspectus

1

现在，下次有C语言程序员告诉你其他编程语言太慢时，你可以通过第一手经验告诉他：重要的不是编程语言的性能，而是语言如何轻松地编写高效程序 ;) - Cephalopod

2

@Arian 不管是来自C程序员还是其他类型的程序员，任何没有证据支持的陈述都不应该被相信。 - juanchopanza

我主要使用向量，因为多项式是预先计算的。虽然可以将它们转换为普通数组，但实际上原因在于循环（请参见下面的答案）。 @Zac：是的，这就是我使用它的方式。fields中的每个int都是相应产品中的一个变量。 - choeger

显示剩余2条评论

3个回答

8

首先，您应该更改此行

for (Product p : sum)

成为

for (Product const& p: sum)

每次迭代都会分配、复制和释放一个带有其包含的 std::vector<int> 的新的 Product。我没有看到其他的情况，但由于它靠近内层循环，我预计会有很大的影响。

- Dietmar Kühl

2

根据你的回答，看起来你正在使用以下结构：

struct Product 
{
    int factor;
    vector<int> fields;
};

以极其低效的方式存储。也就是说，多项式4 x ^ 3将被存储为

Product p { 4, {1, 1, 1} };

这种方法在处理能力和内存方面都非常低效。相反，如果您将多项式的特定项存储在预定向量中：

vector<int> Polynomial { 1, 4, 3, 5 }; // 5x^3 + 3x^2 + 4x + 1

如果多项式的次数由指数决定。那么，计算多项式的函数就是：

int evaluate(int x, const std::vector<int>& polynomial)
{
    int result = 0;
    for (std::size_t i = 0; i < polynomial.size(); ++i)
    {
        //        coefficient     x to the correct power
        result += polynomial[i] * std::pow(x, i);
    }
    return result;
}

作为一个附注：相同的优化可以应用于你的Java代码。

如果由于某种原因您不想使用std:pow，那么你可以很容易地自己实现：

int pow(int x, unsigned int p)
{
    int result = 1;
    for (unsigned int i = 0; i < p; ++i)
    {
        result *= x;
    }
    return result;
}

如果您关心的是稀疏多项式：

struct SubPolynomial
{
    int Coefficient;
    unsigned int Degree;
};

std::vector<SubPolynomial> polynomial;

int evaluate(int x, const std::vector<int>& polynomial)
{
    int result = 0;
    std::for_each(polynomial.begin(), polynomial.end(), [&](const SubPolynomial& s)
    {
        //        coefficient     x to the correct power
        result += s.Coefficient * pow(x, s.Degree);
    });
    return result;
}

请注意，如果您有一个完整的多项式，您将使用比第一个示例所需的内存多两倍的内存。但是，如果您有一个稀疏多项式（例如，次数为N且少于N / 2个系数为非零），则最多只会使用相同数量的内存。

- Zac Howland

抱歉，但这并不适用。首先，我正在讨论整数多项式，因此std::pow可能是浪费时间。其次（更糟的是），我有很多非常稀疏的多个参数。因此，我要么必须使用大量的^0，要么跳转到给定向量。话虽如此，我有一个Horner-Scheme实现（使用整数pow()）搁置在那里，现在可以简单地测试它。 - choeger

即使您不喜欢std::pow（出于任何原因），自己实现它也很简单。使用向量来保存多项式的次数是一种严重浪费空间的做法 - 正如我在这里展示的那样，您可以使用索引作为次数，因此每个系数只需要1个整数（而不是M（1 + Nk）个整数，其中M是非零系数的数量，Nk是第k个元素的次数）。如果稀疏项有问题，您可以将其修改为每个项保存2个整数：系数和次数，这仍然比您拥有的效率要高得多。 - Zac Howland

正如我所说：我将测试一个高效的整数幂实现，但不要期望有很大的好处，因为我的次数相当低。关于空间：任何多项式都可以有多个参数。我可以在度数旁边存储索引，但仍需要一个向量。这只是我的例子只有一个参数。 - choeger

@choeger：“任何多项式都可以有多个参数。” 如果您的意思是有多个不确定因子，那么您的示例与您的目标不符。{N {1, 1}}表示的是Nxy，而不是x_1^2（无论那是什么）。 - Zac Howland

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- juanchopanza · Accepted Answer

15

您正在这里制作昂贵的副本：

for (Product p : sum)

每次复制都意味着完全复制每个元素的std::vector<int>数据成员。改用引用代替：

for (const Product& p : sum)

请注意我将它们设为const，因为您不需要更改范围内的元素。

- juanchopanza

并且使用一个数组，而不是一个向量。 - Daniel Nuriyev

只有在分析表明使用向量会导致明显的性能损失时，才应该使用@DanielNuriyev。但是首先可以尝试其他方法，例如将基于范围的循环更改为基于索引的循环。 - juanchopanza

有人写了这样一篇文章：http://lemire.me/blog/archives/2012/06/20/do-not-waste-time-with-stl-vectors/ - Daniel Nuriyev

3

添加移动语义和正确使用vector的构造函数，可以让该博客文章中提到的3个参数不再成立。正如以往，C++不能保证优化你编写不良的代码（我不知道有任何一种语言可以确保这一点）。 - Zac Howland

2

太棒了，非常感谢。其实我也知道这个 C++ 语法细节，只是没在循环中注意到它。经过三天的烦恼，最终在准备问题时花费了1小时+10分钟等待 stackoverflow 的帮助才解决了问题。但这件事最好不要让我的雇主知道 ;)。 - choeger

显示剩余4条评论