C++中的自我数字

Question

C++中的自我数字

18

嘿，我的朋友们和我正在尝试在1到一百万之间生成“自我数字”并互相比较运行时间。我已经用c ++编写了我的代码，并且仍在努力缩短宝贵的时间。

以下是我的代码：

#include <iostream>

using namespace std;

bool v[1000000];
int main(void) {
  long non_self = 0;
  for(long i = 1; i < 1000000; ++i) {
    if(!(v[i])) std::cout << i << '\n';
    non_self = i + (i%10) + (i/10)%10 + (i/100)%10 + (i/1000)%10 + (i/10000)%10 +(i/100000)%10;
    v[non_self] = 1;
  }
  std::cout << "1000000" << '\n';
  return 0;
}

代码现在可以正常运行，我只是想进行优化。有什么建议吗？谢谢。

- Anon

8

定义“自我数”。 - Anon.

8

我理解你正在输，是吗？ - David M

3

http://en.wikipedia.org/wiki/Self_number - Mike Gleason jr Couturier

1

自数：自数，哥伦比亚数或德夫拉利数是一个整数，在给定的基础上，不能由任何其他整数加上该整数的数字之和生成。（来自维基百科） - James Wiseman

1

数组溢出。当i >= 999955时，non_self >= 1000000，因此您访问了v的维度之外。 - Josh Kelley

显示剩余8条评论

15个回答

13

生成数字一次，将输出作为一个巨大的字符串复制到您的代码中。打印该字符串。

- Jimmy

1

这也是欺骗Project Euler“2分钟规则”的方法。编写一个需要一周时间才能运行的程序，但输出一个仅需几分之一秒即可打印出相同结果的程序。毕竟，Project Euler没有规定你的工具链必须在2分钟内运行完毕，只有你的最终程序必须符合此要求。 - Steve Jessop

也许现在已经改变了，但当我还活跃在欧拉计划中时，它是一个一分钟的规则，并且只是一个建议，如果你的程序花费的时间超过了这个限制，那么你可能采用了错误的解决方法。 - Ponkadoodle

抱歉，没问题，请等一下。我已经有一段时间没有解决过任何问题了。而且在欧拉计划中，没有人关心或要求查看您的代码，这在这个问题中可能并不是这种情况。我的观点是，在这两种情况下，显然都是聪明的作弊行为。但这并不意味着这不是一个好方法，如果您认为可以骗过去的话 :-) - Steve Jessop

13

这些模块(%)看起来很昂贵。如果你被允许转换为十六进制(甚至二进制)，那么你可能可以更快地编写代码。如果你必须保持十进制，尝试为每个位数(个位、十位、百位)创建一个数字数组，并编写一些溢出代码。那会使得对这些数字求和变得容易得多。

或者，你可以认识到核心自身函数的行为（让我们称之为s）：

s = n + f(b,n)

f(b,n)是将数字n在b进制下的各个位上的数字相加得到的和。

对于十进制，显然当个位数从0,1,2，...，9移动时，随着从n移动到n+1，n和f(b,n)会同步进行。只有当9滚动到0时才有10%的情况不成立，因此：

f(b,n+1) = f(b,n) + 1  // 90% of the time

因此，核心自我函数 s 会随之前进。

n+1 + f(b,n+1) = n + 1 + f(b,n) + 1 = n + f(b,n) + 2

s(n+1) = s(n) + 2 // again, 90% of the time

在剩下的（且容易识别的）10%的时间中，数字9会回滚到零并在下一位数上加1，最简单的情况是从运行总和中减去（9-1），但可能会通过一系列的9级联向上减去99-1、999-1等。

因此，第一个优化可以从90%的循环中删除大部分工作！

if ((n % 10) != 0) 
{
  n + f(b,n) = n-1 + f(b,n-1) + 2;
}

或者。

if ((n % 10) != 0)
{
  s = old_s + 2;
}

这应该足以大幅提高性能，而不会真正改变您的算法。

如果您需要更多性能，那么请为剩余10％之间的迭代变化制定一个简单的算法。

- Alex Brown

5

如果你希望输出速度更快，可以考虑使用普通的printf()替换iostream输出——这取决于赢得比赛的规则是否重要。

- anon

3

由于范围有限（1到1000000），数字的最大和不超过9*6 = 54。这意味着要实现筛法，一个大小为 54个元素的循环缓冲区 应该足够完美（当范围增加时，筛子的大小增长非常缓慢）。

您已经有了基于筛法的解决方案，但它基于预先构建完整长度的缓冲区（1000000个元素的筛子），这相当不优雅（如果不是完全不可接受的）。您的解决方案的性能也受到内存访问的非局部性的影响。

例如，这是可能的一种非常简单的实现方式。

#define N 1000000U

void print_self_numbers(void)
{
  #define NMARKS 64U /* make it 64 just in case (and to make division work faster :) */

  unsigned char marks[NMARKS] = { 0 };
  unsigned i, imark;

  for (i = 1, imark = i; i <= N; ++i, imark = (imark + 1) % NMARKS)
  {
    unsigned digits, sum;

    if (!marks[imark])
      printf("%u ", i);
    else
      marks[imark] = 0;

    sum = i;
    for (digits = i; digits > 0; digits /= 10)
      sum += digits % 10;

    marks[sum % NMARKS] = 1;
  }
}

我这里并不追求最佳的CPU时钟性能，只是用循环缓冲区阐述了关键思想。

当然，范围可以轻易地成为函数的一个参数，而循环缓冲区的大小可以从范围在运行时得出。

至于“优化”……试图优化包含I/O操作的代码是没有意义的。这样的优化不会有任何效果。如果你想分析算法本身的性能，你必须将生成的数字放入输出数组中，并稍后打印它们。

- AnT stands with Russia

3

多线程（为每个线程使用不同的数组/范围）。此外，不要使用超过您CPU核心数的线程=)

- Viktor Sehr

每个范围的部分取决于之前范围设置的标志。你需要进行一些智能重叠。 - Jimmy

在OP的算法中，是的 - 我必须承认我从未理解过它是如何工作的。我设法摆脱了那个范围传播的要求，所以我的解决方案确实适用于多线程处理。 - Carl Smotricz

3

在循环中使用cout或printf会导致速度变慢。如果你能将循环中的打印操作移除，你将看到显著的性能提升。

- Brian T Hannan

1

这可能有助于加快C++ iostreams的输出速度：

cin.tie(0);
ios::sync_with_stdio(false);

在开始向cout输出之前，将它们放在main函数中。

- Zan Lynx

1

我基于Carl Smotricz的第二个算法创建了一个基于CUDA的解决方案。识别自数的代码本身非常快 - 在我的机器上执行时间约为45纳秒;这比Carl Smotricz的算法快了大约150倍，后者在我的机器上运行时间为7毫秒。

然而，存在一个瓶颈，似乎是PCIe接口。我的代码花费了惊人的43毫秒将计算出的数据从显卡移回RAM。这可能是可以优化的，我会研究一下。

不过，45纳秒相当快。实际上，非常快，我向程序中添加了代码，运行Carl Smotricz的算法并比较准确性的结果。结果是准确的。以下是程序输出（在VS2008 64位、Windows7中编译）：

更新

我使用静态运行时库，在完全优化的发布模式下重新编译了此代码，并取得了显著的结果。优化器似乎对Carl的算法做得很好，将运行时间从7毫秒减少到1毫秒。CUDA实现也加速了，从35微秒缩短到20微秒。从视频卡到RAM的内存复制没有受到影响。

程序输出：

Running on device: 'Quadro NVS 295'
Reference Implementation Ran In 15603 ticks (7 ms)
Kernel Executed in 40 ms -- Breakdown:
  [kernel] : 35 us (0.09%)
  [memcpy] : 40 ms (99.91%)
CUDA Implementation Ran In 111889 ticks (51 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0

代码如下：

文件：main.h

#pragma once

#include <cstdlib>
#include <functional>

typedef std::pair<int*, size_t> sized_ptr;
static sized_ptr make_sized_ptr(int* ptr, size_t size)
{
    return make_pair<int*, size_t>(ptr, size);
}

__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMemory, unsigned const blocks, unsigned const threads);

inline std::string format_elapsed(double d) 
{
    char buf[256] = {0};

    if( d < 0.00000001 )
    {
        // show in ps with 4 digits
        sprintf(buf, "%0.4f ps", d * 1000000000000.0);
    }
    else if( d < 0.00001 )
    {
        // show in ns
        sprintf(buf, "%0.0f ns", d * 1000000000.0);
    }
    else if( d < 0.001 )
    {
        // show in us
        sprintf(buf, "%0.0f us", d * 1000000.0);
    }
    else if( d < 0.1 )
    {
        // show in ms
        sprintf(buf, "%0.0f ms", d * 1000.0);
    }
    else if( d <= 60.0 )
    {
        // show in seconds
        sprintf(buf, "%0.2f s", d);
    }
    else if( d < 3600.0 )
    {
        // show in min:sec
        sprintf(buf, "%01.0f:%02.2f", floor(d/60.0), fmod(d,60.0));
    }
    // show in h:min:sec
    else 
        sprintf(buf, "%01.0f:%02.0f:%02.2f", floor(d/3600.0), floor(fmod(d,3600.0)/60.0), fmod(d,60.0));

    return buf;
}

inline std::string format_pct(double d)
{
    char buf[256] = {0};
    sprintf(buf, "%.2f", 100.0 * d);
    return buf;
}

文件：main.cpp

#define _CRT_SECURE_NO_WARNINGS 

#include <windows.h>
#include "C:\CUDA\include\cuda_runtime.h"
#include <cstdlib>
#include <iostream>
#include <string>
using namespace std;
#include <cmath>
#include <map>
#include <algorithm>
#include <list>

#include "main.h"

int main()
{
    unsigned numVals = 1000000;
    int* gold = new int[numVals];
    memset(gold, 0, sizeof(int)*numVals);

    LARGE_INTEGER li = {0}, li2 = {0};
    QueryPerformanceFrequency(&li);
    __int64 freq = li.QuadPart;

    // get cuda properties...
    cudaDeviceProp cdp = {0};
    cudaError_t err = cudaGetDeviceProperties(&cdp, 0);
cout << "Running on device: '" << cdp.name << "'" << endl;

    // first run the reference implementation
    QueryPerformanceCounter(&li);
    for( int j6=0, n = 0; j6<10; j6++ ) 
    {
        for( int j5=0; j5<10; j5++ ) 
        {
            for( int j4=0; j4<10; j4++ ) 
            {
                for( int j3=0; j3<10; j3++ ) 
                {
                    for( int j2=0; j2<10; j2++ ) 
                    {
                        for( int j1=0; j1<10; j1++ )  
                        {
                            int s = j6 + j5 + j4 + j3 + j2 + j1;
                            gold[n + s] = 1;
                            n++;
                        }
                    }
                }
            }
        }
    }
    QueryPerformanceCounter(&li2);
    __int64 ticks = li2.QuadPart-li.QuadPart;
    cout << "Reference Implementation Ran In " << ticks << " ticks" << " (" << format_elapsed((double)ticks/(double)freq) << ")" << endl;

    // now run the cuda version...
    unsigned threads = cdp.maxThreadsPerBlock;
    unsigned blocks = numVals/threads;
    if( numVals%threads ) ++blocks;
    unsigned computeSlots = blocks * threads;   // this may be != the number of vals since we want 32-thread warps

    // allocate device memory for test
    int* deviceTest = 0;
    err = cudaMalloc(&deviceTest, sizeof(int)*computeSlots);
    err = cudaMemset(deviceTest, 0, sizeof(int)*computeSlots);

    int* hostTest = new int[numVals];   // the repository for the resulting data on the host
    memset(hostTest, 0, sizeof(int)*numVals);

    // run the CUDA code...
    LARGE_INTEGER li3 = {0}, li4={0};
    QueryPerformanceCounter(&li3);
    ComputeSelfNumbers(make_sized_ptr(hostTest, numVals), make_sized_ptr(deviceTest, computeSlots), blocks, threads);
    QueryPerformanceCounter(&li4);

    __int64 ticksCuda = li4.QuadPart-li3.QuadPart;
    cout << "CUDA Implementation Ran In " << ticksCuda << " ticks" << " (" << format_elapsed((double)ticksCuda/(double)freq) << ")" << endl;
    cout << "Compute Slots: " << computeSlots << " (" << blocks << " blocks X " << threads << " threads)" << endl;


    unsigned errorCount = 0;
    for( size_t i = 0; i < numVals; ++i )
    {
        if( gold[i] != hostTest[i] )
        {
            ++errorCount;
        }
    }

    cout << "Number of Errors: " << errorCount << endl;

    return 0;
}

文件：self.cu

#pragma warning( disable : 4231)
#include <windows.h>
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <iomanip>
using namespace std;
#include "main.h"

__global__ void SelfNum(int * slots)
{
    __shared__ int N;
    N = (blockIdx.x * blockDim.x) + threadIdx.x;

    const int numDigits = 10;

    __shared__ int digits[numDigits];
    for( int i = 0, temp = N; i < numDigits; ++i, temp /= 10 )
    {
        digits[numDigits-i-1] = temp - 10 * (temp/10)      /*temp % 10*/;
    }

    __shared__ int s;
    s = 0;
    for( int i = 0; i < numDigits; ++i )
        s += digits[i];

    slots[N+s] = 1;
}

__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMem, const unsigned  blocks, const unsigned threads)
{
    LARGE_INTEGER li = {0};
    QueryPerformanceFrequency(&li);
    double freq = (double)li.QuadPart;

    LARGE_INTEGER liStart = {0};
    QueryPerformanceCounter(&liStart);

    // run the kernel
    SelfNum<<<blocks, threads>>>(deviceMem.first);
    LARGE_INTEGER liKernel = {0};
    QueryPerformanceCounter(&liKernel);

    cudaMemcpy(hostMem.first, deviceMem.first, hostMem.second*sizeof(int), cudaMemcpyDeviceToHost); // dont copy the overflow - just throw it away
    LARGE_INTEGER liMemcpy = {0};
    QueryPerformanceCounter(&liMemcpy);

    // display performance stats
    double e = double(liMemcpy.QuadPart - liStart.QuadPart)/freq,
        eKernel = double(liKernel.QuadPart - liStart.QuadPart)/freq,
        eMemcpy = double(liMemcpy.QuadPart - liKernel.QuadPart)/freq;

    double pKernel = eKernel/e,
        pMemcpy = eMemcpy/e;

    cout << "Kernel Executed in " << format_elapsed(e) << " -- Breakdown: " << endl
        << "  [kernel] : " << format_elapsed(eKernel) << " (" << format_pct(pKernel) << "%)" << endl
        << "  [memcpy] : " << format_elapsed(eMemcpy) << " (" << format_pct(pMemcpy) << "%)" << endl;



}

更新2：

我重构了我的CUDA实现，试图加快速度。我通过手动展开循环、修复一些可能是错误的__shared__内存使用方式以及消除一些冗余来实现这一点。

我的新内核的输出为：

Reference Implementation Ran In 69610 ticks (5 ms)
Kernel Executed in 2 ms -- Breakdown:
  [kernel] : 39 us (1.57%)
  [memcpy] : 2 ms (98.43%)
CUDA Implementation Ran In 62970 ticks (4 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0

我所更改的唯一代码就是内核本身，因此这里我只会发布它。

__global__ void SelfNum(int * slots)
{
    int N = (blockIdx.x * blockDim.x) + threadIdx.x;

    int s = 0;

    int temp = N;
    s += temp - 10 * (temp/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;
    s += temp - 10 * ((temp/=10)/10)      /*temp % 10*/;

    slots[N+s] = 1;
}

- John Dibling

1

对于这样简单的任务，最好的选择是考虑使用替代算法来产生相同的结果。通常情况下，%10 不被认为是一种快速操作。

- kauppi

我不知道你在做什么，但模除运算符是绝对基础的，必须快速运行。这里在一个1800XP处理器上完成了4M个模除运算和4M个赋值操作，在50ms内完成，而在30ms内完成4M个赋值操作。 - Pasi Savolainen

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Carl Smotricz · Accepted Answer

我构建了一个替代的C语言解决方案，不需要任何模数或除法运算：

#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[]) {
   int v[1100000];
   int j1, j2, j3, j4, j5, j6, s, n=0;
   memset(v, 0, sizeof(v));
   for (j6=0; j6<10; j6++) {
      for (j5=0; j5<10; j5++) {
         for (j4=0; j4<10; j4++) {
            for (j3=0; j3<10; j3++) {
               for (j2=0; j2<10; j2++) {
                  for (j1=0; j1<10; j1++) {
                     s = j6 + j5 + j4 + j3 + j2 + j1;
                     v[n + s] = 1;
                     n++;
                  }
               }
            }
         }
      }
   }
   for (n=1; n<=1000000; n++) {
      if (!v[n]) printf("%6d\n", n);
   }
}

它生成了97786个自我数，包括1和1000000。
输出需要:

real        0m1.419s
user        0m0.060s
sys         0m0.152s

当我将输出重定向到/dev/null时，它需要

real     0m0.030s
user     0m0.024s
sys      0m0.004s

在我的3 GHz四核计算机上。

为了比较，你的版本产生了相同数量的数字，因此我认为我们要么都正确，要么都错了；但是你的版本占用了更多的资源。

real    0m0.064s
user    0m0.060s
sys     0m0.000s

在相同的条件下，或者大约增加了两倍。

这可能是因为你使用了long，但在我的机器上并不需要。在这里，int 可以达到 20 亿。也许你应该在你的机器上检查一下 INT_MAX？

更新

我有一个直觉，按照块计算总和可能更好。这是我的新代码：

#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[]) {
   char v[1100000];
   int j1, j2, j3, j4, j5, j6, s, n=0;
   int s1, s2, s3, s4, s5;
   memset(v, 0, sizeof(v));
   for (j6=0; j6<10; j6++) {
      for (j5=0; j5<10; j5++) {
         s5 = j6 + j5;
         for (j4=0; j4<10; j4++) {
            s4 = s5 + j4;
            for (j3=0; j3<10; j3++) {
               s3 = s4 + j3;
               for (j2=0; j2<10; j2++) {
                  s2 = s3 + j2;
                  for (j1=0; j1<10; j1++) {
                     v[s2 + j1 + n++] = 1;
                  }
               }
            }
         }
      }
   }
   for (n=1; n<=1000000; n++) {
      if (!v[n]) printf("%d\n", n);
   }
}

然后你知道吗，这将前面的循环时间从12ms降低到4ms。也许是8ms，我的时钟在那里有点抖动。

现状和总结

实际上找到1M以内的自我数现在大约需要4毫秒，我很难再测量任何进一步的改进。另一方面，只要输出到控制台，它仍将花费约1.4秒，尽管我已经尽力利用缓冲区。I/O时间如此之长，以至于进一步优化实际上是徒劳的。因此，尽管受到其他评论的启发，我决定不再进行优化。

所引用的所有时间都是在我的（相当快的）计算机上，并且仅用于相互比较。您的结果可能会有所不同。