在C++中有非常快的近似对数（自然对数）函数吗？

Question

在C++中有非常快的近似对数（自然对数）函数吗？

c++mathlogarithmmicro-optimizationsqrt

20

我们可以找到各种技巧来替换std::sqrt（计时平方根），以及一些用于std::exp的技巧（使用更快的指数近似），但我没有找到任何可以替换std::log的技巧。

它是我的程序中循环的一部分，并且被多次调用，而且虽然exp和sqrt已经被优化了，但Intel VTune现在建议我优化std::log，之后似乎只有我的设计选择会受到限制。

目前，我使用x在-0.5和+0.5之间的ln(1+x)的三阶泰勒逼近（最大误差为4%的情况下占90%），否则回退到std::log。这使我加速了15%。

- qwark

3

在现代CPU上，std::sqrt编译为单个指令。很难相信你可以用类似的精度做得比这更快。 - plasmacel

1

如果float的精度足够，为什么不从cmath中调用logf()呢？或者问题是您需要double的完整输入域，但计算结果仅相当于float的精度（约6个小数位）？ - njuffa

1

@user3091460，那个网站上的误差计算不正确。sqrtss可以精确到完整精度，而rsqrtss * x加上一次牛顿-拉弗森迭代仍然无法达到完整精度。 - plasmacel

2

你认为你的实现中的 std::log 函数没有使用系统可用的最有效算法吗？如果你愿意为了速度而牺牲精度（我可能会说一些关于快速得到错误答案的事情），你需要在问题中说明。 - Keith Thompson

1

目前我使用ln(1+x)的三阶泰勒逼近，其中x在-0.5和+0.5之间（最大误差为4%的情况下占90%），否则退回到std::log。这给了我15%的加速。 - qwark

显示剩余9条评论

8个回答

4

请看这个讨论，其中所接受的答案涉及一个基于Zeckendorf分解的对数计算函数的实现。

在实现文件的评论中，有关复杂性和一些技巧以达到O(1)的讨论。

希望这可以帮到你！

- Mo Abdul-Hameed

我会看一下，谢谢。 - qwark

2

#include <math.h>
#include <iostream>

constexpr int LogPrecisionLevel = 14;
constexpr int LogTableSize = 1 << LogPrecisionLevel;

double log_table[LogTableSize];

void init_log_table() {
    for (int i = 0; i < LogTableSize; i++) {
        log_table[i] = log2(1 + (double)i / LogTableSize);
    }
}

double fast_log2(double x) { // x>0
    long long t = *(long long*)&x;
    int exp = (t >> 52) - 0x3ff;
    int mantissa = (t >> (52 - LogPrecisionLevel)) & (LogTableSize - 1);
    return exp + log_table[mantissa];
}

int main() {
    init_log_table();

    double d1 = log2(100); //6.6438561897747244
    double d2 = fast_log2(100); //6.6438561897747244
    double d3 = log2(0.01); //-6.6438561897747244
    double d4 = fast_log2(0.01); //-6.6438919626096089
}

- Saar Ibuki

你的类型转换违反了严格别名规则。（使用memcpy代替指针转换。此外，你可能应该使用unsigned long long，因为你不需要算术移位。在补码机器上正确性无关紧要，但仍然需要注意。）这还要求整数字节序与浮点数字节序匹配，例如在x86上，所以你至少应该记录一下。 - Peter Cordes

解释表查找策略、在某些输入范围内的实际相对/绝对精度以及限制，例如0或负输入的情况，是一个好主意。 - Peter Cordes

你的表格只需要是float类型。这将减少缓存占用一半的空间。（但是一个2^14 * 4字节的表格仍然是64kiB。在大多数情况下，你会遇到很多缓存未命中，这就是为什么大多数快速对数实现在现代CPU上使用多项式逼近而不是表查找的原因。特别是当你可以使用SIMD时：AVX2中log2(__m256d)的高效实现） - Peter Cordes

1

Peter，抱歉评论了一个非常古老的答案，但是这里真的违反了严格别名规则吗？我猜你指的是fast_log2函数中的第一行。我认为这里确实没有别名，x的值被复制，重新解释为long long（所以与memcpy非常相似）。除非我漏掉了什么，否则t不是别名，对吧？ - Asher

@Asher - 由于您没有@通知我，所以我没有看到您的回复。*(long long*)&x;解引用了一个指向double对象的long long*。这是严格别名UB。另请参见*实践中的联合、别名和类型转换：什么有效，什么无效？*。一些编译器在简单情况下可能会将其编译为与memcpy(&t, &x, sizeof(t));或auto t = std::bit_cast<uint64_t>(x);相同的方式，但除非使用gcc -fno-strict-aliasing禁用基于类型的别名分析优化，否则不能保证。 - Peter Cordes

2

我提议在Visual Studio（x86）中编写ln(x)代码的以下版本。这里b0，...，b3是多项式的调整系数，该多项式是通过Chebyshev逼近函数f(t) = lb((5+t)/(5-t))/t得到的。

首先计算t/5 = (x-1)/(x+1)，其中x在[0.75; 1.5]范围内。在获得近似值f(t)之后，通过以下公式计算结果： ln(x) = (t*f(t)+k)*ln(2)，

其中k是将x的指数减小到[0.75; 1.5]范围所需的数字。最大相对误差为2.53708704e-07，它是3.49952803 ulps。均方根相对误差为5.06926602e-08。

_declspec(naked) float _vectorcall ln(float x)
{
  static const float ct[6] =       // Constant table
  {
    1.0f,                          // 1
    0.576110899f,                  // b2*5^5
    0.961808264f,                  // b1*5^3
    2.88539004f,                   // b0*5
    0.442831367f,                  // b3*5^7
    0.693147181f                   // ln(2)
  };
  _asm
  {
    vmovd eax,xmm0                 // Read the binary representation of the x into eax 
    mov edx,-127                   // In edx: neg offset of the exponent of normalized numbers
    cmp eax,0x7F800000             // Compare x with the Inf value
    jnc ln_bad_x                   // Break calculations if x<=-0, x=+Inf or x=+NaN
    ror eax,23                     // Shift eax so that its exponent is in al
    movzx ecx,al                   // Get the exponent with an offset of +127 in ecx
    jecxz ln_denorm                // Jump if x=0 or x is denormalized
      ln_calc:                     // The entry point after normalization of the number
    setnc al                       // al=1 if the mantissa is less than 1.5, otherwise al=0
    adc ecx,edx                    // ecx=k - the integer part of the binary logarithm
    or eax,126                     // Change the exponent of x so that it becomes 0.75<=x<1.5
    ror eax,9                      // In eax: the value of x, in cf: its sign bit
    vmovd xmm0,eax                 // Write the reduced value of x into xmm0
    mov eax,offset ct              // In eax the address of the constant table
    vaddss xmm1,xmm0,[eax]         // xmm1 = x+1
    vsubss xmm0,xmm0,[eax]         // xmm0 = x-1
    vcvtsi2ss xmm3,xmm3,ecx        // xmm3 = k, the integer addition to the binary logarithm
    vdivss xmm0,xmm0,xmm1          // xmm0 = (x-1)/(x+1)=t/5
    vmovss xmm1,[eax+16]           // xmm1 = 5^7*b3 - initialize the sum
    vmulss xmm2,xmm0,xmm0          // xmm2 = t^2/25 - prepare the argument of the polynomial
    vfmadd213ss xmm1,xmm2,[eax+4]  // Step 1 calculating the polynomial by the Нorner scheme
    vfmadd213ss xmm1,xmm2,[eax+8]  // Step 2 calculating the polynomial by the Нorner scheme
    vfmadd213ss xmm1,xmm2,[eax+12] // Step 3 calculating the polynomial by the Нorner scheme
    vfmadd213ss xmm0,xmm1,xmm3     // xmm0 = t*f(t)+k - ready binary logarithm
    vmulss xmm0,xmm0,[eax+20]      // Convert binary logarithm to natural
    ret                            // Return
      ln_denorm:                   // Processing denormalized values of x including x=0
    bsr ecx,eax                    // Search for the highest set bit; zf=1 if x=+0
    mov dl,98                      // 31 is added to the exponent, so we take edx=-158
    ror eax,cl                     // Form a mantissa of a normalized number x*2^31
    jnz ln_calc                    // Go to calculate ln(x) if x>0
    mov dl,128                     // Form the highest word of the value -Inf in dx
    vpinsrw xmm0,xmm0,edx,1        // Replace the highest word of lowest float in xmm0
    ret                            // Return the result –Inf for x=+0
      ln_bad_x:                    // The entry point for x<=-0, x=+Inf or x=+NaN
    jnl ln_exit                    // Return x for x=+Inf or x=+NaN
    vsqrtss xmm0,xmm0,xmm0         // The root of a negative number gives the result NaN
    vrcpss xmm0,xmm0,xmm0          // In the case of x=-0 generate the result -Inf
      ln_exit:                     // The result in xmm0 is ready
    ret                            // Return
  }
}

- aenigma

Peter Cordes，感谢您的有趣评论。实际上，我并不是一个很擅长优化方面的专家。您可能是对的，sbb al,al比salc稍微好一点。 - aenigma

2

@qwark：为获得客观评估，建议在同一台机器上（如果可能，在相同的环境下）对您感兴趣的算法进行比较测试。如果算法提供显着不同的准确性，则还可以使用诸如运行时间与RMS相对误差比之类的指标进行比较。 - aenigma

1

在我看来，一个很好的测试是遍历从0到无穷大的所有浮点数，并将结果与通过库对数函数获得的双精度值进行比较。 - aenigma

1

正确的说法是，不是比率，而是工作时间和误差的乘积。越少越好。 - aenigma

1

@Peter Cordes: 的确，为了减少在英特尔和 AMD 上性能的差异，这三条指令（salc: adc ecx,edx: add al,127）可以替换为以下指令：setnc al: adc ecx,edx: or al,126。 - aenigma

显示剩余5条评论

2

精度和性能略有提高：

#include <bit>   // C++20

//fast_log abs(rel) : avgError = 2.85911e-06(3.32628e-08), MSE = 4.67298e-06(5.31012e-08), maxError = 1.52588e-05(1.7611e-07)
const float s_log_C0 = -19.645704f;
const float s_log_C1 = 0.767002f;
const float s_log_C2 = 0.3717479f;
const float s_log_C3 = 5.2653985f;
const float s_log_C4 = -(1.0f + s_log_C0) * (1.0f + s_log_C1) / ((1.0f + s_log_C2) * (1.0f + s_log_C3)); //ensures that log(1) == 0
const float s_log_2 = 0.6931472f;

// assumes x > 0 and that it's not a subnormal.
// Results for 0 or negative x won't be -Infinity or NaN
inline float fast_log(float x)
{
    unsigned int ux = std::bit_cast<unsigned int>(x);
    int e = static_cast<int>(ux - 0x3f800000) >> 23; //e = exponent part can be negative
    ux |= 0x3f800000;
    ux &= 0x3fffffff; // 1 <= x < 2  after replacing the exponent field
    x = std::bit_cast<float>(ux);
    float a = (x + s_log_C0) * (x + s_log_C1);
    float b = (x + s_log_C2) * (x + s_log_C3);
    float c = (float(e) + s_log_C4);
    float d = a / b;
    return (c + d) * s_log_2;
}

虽然除法被认为是一种缓慢的运算，但处理器可以并行执行许多操作。

- jenkas

1

我还需要一个快速的日志近似算法，到目前为止最好的算法似乎是基于Ankerls算法。

解释: http://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/

代码 (来自 https://github.com/ekmett/approximate/blob/dc1ee7cef58a6b31661edde6ef4a532d6fc41b18/cbits/fast.c#L127):

double log_fast_ankerl(double a)
{
  static_assert(__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__, "Little endian is required!");
  union { double d; int x[2]; } u = { a };
  return (u.x[1] - 1072632447) * 6.610368362777016e-7;
}

只需一次减法和乘法。这个方法非常好且速度非常快，无人能敌。

- Kevin Meier

1

你忘记了提取double的高半部分作为int（例如x86 psrlq xmm0，32，如果您的编译器足够聪明，不会将值反弹到整数寄存器和返回），以及int到double的转换（例如x86 cvtdq2pd）。它仍然非常快，特别是如果您的编译器做得很好，但是应该检查生成的汇编代码，看看编译器是否使用了SIMD-整数减法而不是movq rax，xmm0 / shr rax，32 / sub eax，1072632447 / cvtsi2sd xmm0，eax。不幸的是，gcc和clang确实会反弹到标量整数并返回：https://godbolt.org/z/osYoj5bz1 - Peter Cordes

1

我对@njuffa的答案进行了向量化。自然对数，使用AVX2：

inline __m256 mm256_fmaf(__m256 a, __m256 b, __m256 c){
    return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}

//https://dev59.com/fFkS5IYBdhLWcg3wdmnC#39822314
//https://dev59.com/fFkS5IYBdhLWcg3wdmnC#65537754
// vectorized version of the answer by njuffa
/* natural log on [0x1.f7a5ecp-127, 0x1.fffffep127]. Maximum relative error 9.4529e-5 */
inline __m256 fast_log_sse(__m256 a){

    __m256i aInt = *(__m256i*)(&a);
    __m256i e =    _mm256_sub_epi32( aInt,  _mm256_set1_epi32(0x3f2aaaab));
            e =    _mm256_and_si256( e,  _mm256_set1_epi32(0xff800000) );
        
    __m256i subtr =  _mm256_sub_epi32(aInt, e);
    __m256 m =  *(__m256*)&subtr;

    __m256 i =  _mm256_mul_ps( _mm256_cvtepi32_ps(e), _mm256_set1_ps(1.19209290e-7f));// 0x1.0p-23
    /* m in [2/3, 4/3] */
    __m256 f =  _mm256_sub_ps( m,  _mm256_set1_ps(1.0f) );
    __m256 s =  _mm256_mul_ps(f, f); 
    /* Compute log1p(f) for f in [-1/3, 1/3] */
    __m256 r =  mm256_fmaf( _mm256_set1_ps(0.230836749f),  f,  _mm256_set1_ps(-0.279208571f) );// 0x1.d8c0f0p-3, -0x1.1de8dap-2
    __m256 t =  mm256_fmaf( _mm256_set1_ps(0.331826031f),  f,  _mm256_set1_ps(-0.498910338f) );// 0x1.53ca34p-2, -0x1.fee25ap-2

           r =  mm256_fmaf(r, s, t);
           r =  mm256_fmaf(r, s, f);
           r =  mm256_fmaf(i, _mm256_set1_ps(0.693147182f),  r);  // 0x1.62e430p-1 // log(2)
    return r;
}

- Kari

请注意，您的mm256_fmaf可以编译为单独的乘法和加法操作，并对中间产品进行舍入。它不能保证是FMA。（只有像GCC这样的一些编译器会在目标支持FMA时将其“合并”为FMA指令，就像大多数AVX2机器所做的那样（不是全部：一个VIA设计）。最好只针对AVX2+FMA3并使用_mm256_fmadd_ps，如果需要可以使用可选的回退，但默认情况下不要使用具有误导性名称且可能更慢的fma函数。 - Peter Cordes

-1

这取决于您需要的准确度。通常会调用日志以了解数字的数量级，您可以通过检查浮点数的指数字段来免费完成此操作。那也是您的第一个近似值。我要向我的书《基本算法》推销，该书解释了如何从第一原则实现标准库数学函数。

- Malcolm McLean

我正在寻找实际数学应用中的自然对数，不需要双精度、浮点精度甚至10-3、10-4就可以。 - qwark

2

一个链接或书籍参考而不引用相关部分的翻译不是答案。 - BeyelerStudios

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- njuffa · Accepted Answer

在设计和部署用于性能的超越函数的定制实现之前，强烈建议在算法级别以及通过工具链进行优化。不幸的是，我们没有有关要优化的代码的任何信息，也没有有关工具链的信息。

在算法级别上，检查是否真正需要所有对超越函数的调用。也许存在一种数学变换，需要更少的函数调用，或将超越函数转换为代数运算。是否有可能多余的超越函数调用，例如因计算不必要地在对数空间中切换而导致冗余？如果精度要求适度，是否可以使用float而不是double在整个计算过程中执行整个计算？在大多数硬件平台上，避免double计算可以显著提高性能。

编译器倾向于提供影响数字密集型代码性能的各种开关。除了将一般优化级别增加到-O3之外，通常还有一种方法可以关闭denormal支持，即打开flush-to-zero或FTZ模式。这在各种硬件平台上都具有性能优势。此外，通常有一个“快速数学”标志，其使用会导致略微降低精度并消除处理特殊情况（如NaN和无穷大）以及处理errno的开销。一些编译器还支持代码自动矢量化并带有SIMD数学库，例如Intel编译器。

对数函数的自定义实现通常涉及将二进制浮点参数x分离为指数e和尾数m，使得x = m * 2^e，因此log(x) = log(2) * e + log(m)。m的选择使其接近于单位，因为这提供了高效的近似值，例如通过极小值多项式逼近 log(m) = log(1+f) = log1p(f)。

C ++提供了frexp()函数，将浮点运算数分解为尾数和指数，但在实践中，人们通常使用更快的特定于机器的方法，通过重新解释它们作为相同大小的整数来操作浮点数据。下面的单精度对数函数logf()演示了两种变体。函数__int_as_float()和__float_as_int()提供了将int32_t重新解释为IEEE-754 binary32浮点数及其反向操作的方法。此代码在大多数当前处理器（CPU或GPU）上直接支持的融合乘加操作FMA上严重依赖。在fmaf()映射到软件模拟的平台上，此代码将无法接受地缓慢。

#include <cmath>
#include <cstdint>
#include <cstring>

float __int_as_float (int32_t a) { float r; memcpy (&r, &a, sizeof r); return r;}
int32_t __float_as_int (float a) { int32_t r; memcpy (&r, &a, sizeof r); return r;}

/* compute natural logarithm, maximum error 0.85089 ulps */
float my_logf (float a)
{
    float i, m, r, s, t;
    int e;

#if PORTABLE
    m = frexpf (a, &e);
    if (m < 0.666666667f) {
        m = m + m;
        e = e - 1;
    }
    i = (float)e;
#else // PORTABLE
    i = 0.0f;
    if (a < 1.175494351e-38f){ // 0x1.0p-126
        a = a * 8388608.0f; // 0x1.0p+23
        i = -23.0f;
    }
    e = (__float_as_int (a) - __float_as_int (0.666666667f)) & 0xff800000;
    m = __int_as_float (__float_as_int (a) - e);
    i = fmaf ((float)e, 1.19209290e-7f, i); // 0x1.0p-23
#endif // PORTABLE
    /* m in [2/3, 4/3] */
    m = m - 1.0f;
    s = m * m;
    /* Compute log1p(m) for m in [-1/3, 1/3] */
    r =             -0.130310059f;  // -0x1.0ae000p-3
    t =              0.140869141f;  //  0x1.208000p-3
    r = fmaf (r, s, -0.121483512f); // -0x1.f198b2p-4
    t = fmaf (t, s,  0.139814854f); //  0x1.1e5740p-3
    r = fmaf (r, s, -0.166846126f); // -0x1.55b36cp-3
    t = fmaf (t, s,  0.200120345f); //  0x1.99d8b2p-3
    r = fmaf (r, s, -0.249996200f); // -0x1.fffe02p-3
    r = fmaf (t, m, r);
    r = fmaf (r, m,  0.333331972f); //  0x1.5554fap-2
    r = fmaf (r, m, -0.500000000f); // -0x1.000000p-1  
    r = fmaf (r, s, m);
    r = fmaf (i,  0.693147182f, r); //  0x1.62e430p-1 // log(2)
    if (!((a > 0.0f) && (a < INFINITY))) {
        r = a + a;  // silence NaNs if necessary
        if (a  < 0.0f) r = INFINITY - INFINITY; //  NaN
        if (a == 0.0f) r = -INFINITY;
    }
    return r;
}

正如代码注释中所述，上述实现提供了忠实舍入的单精度结果，并处理与IEEE-754浮点标准一致的异常情况。通过消除特殊情况支持、消除对非规格化参数的支持和降低精度，可以进一步提高性能。这导致以下示例变体:

/* natural log on [0x1.f7a5ecp-127, 0x1.fffffep127]. Maximum relative error 9.4529e-5 */
float my_faster_logf (float a)
{
    float m, r, s, t, i, f;
    int32_t e;

    e = (__float_as_int (a) - 0x3f2aaaab) & 0xff800000;
    m = __int_as_float (__float_as_int (a) - e);
    i = (float)e * 1.19209290e-7f; // 0x1.0p-23
    /* m in [2/3, 4/3] */
    f = m - 1.0f;
    s = f * f;
    /* Compute log1p(f) for f in [-1/3, 1/3] */
    r = fmaf (0.230836749f, f, -0.279208571f); // 0x1.d8c0f0p-3, -0x1.1de8dap-2
    t = fmaf (0.331826031f, f, -0.498910338f); // 0x1.53ca34p-2, -0x1.fee25ap-2
    r = fmaf (r, s, t);
    r = fmaf (r, s, f);
    r = fmaf (i, 0.693147182f, r); // 0x1.62e430p-1 // log(2) 
    return r;
}