快速获取最接近的2的幂次方浮点数

Question

快速获取最接近的2的幂次方浮点数

c++x86floating-pointieee-754

6

在数值计算中，通常需要将数字缩放到安全范围内。

例如，计算欧几里得距离：sqrt(a^2+b^2)。在这里，如果a或b的大小太小/太大，则可能会发生下溢/上溢。

解决这个问题的常见方法是将数字除以最大幅度数字。然而，这种解决方案存在以下问题：

速度慢（除法很慢）
会导致略微的不准确性

因此，我认为我们可以用接近2的幂倒数数来乘以最大幅度数字，而不是除以它。这似乎是一个更好的解决方案，因为：

乘法比除法快得多
更准确，因为与2的幂次数相乘是精确的

因此，我想创建一个类似于以下逻辑的小型实用程序函数（通过^表示指数）：

void getScaler(double value, double &scaler, double &scalerReciprocal) {
    int e = <exponent of value>;
    if (e<-1022) { scaler=2^-1022; scalerReciprocal = 2^1022; }
    } else if (e>1022) { scaler=2^1022; scalerReciprocal = 2^-1022; }
    } else { scaler=2^e; scalerReciprocal = 2^(2046-e); }
}

此函数应返回已标准化的scaler和scalerReciprocal，两者均为2的幂次方数，其中scaler接近于value，scalerReciprocal是scaler的倒数。

scaler/scalerReciprocal的最大允许指数为-1022..1022（我不想处理亚正常的scaler，因为亚正常数可能会很慢）。

有什么快速方法可以做到这一点吗？这能否只用纯浮点运算完成？还是应该从value中提取指数，并使用简单的if语句来进行逻辑处理？是否有某种技巧可以快速比较(-)1022（因为范围对称）？

注意：scaler不需要是最接近的2的幂次方数。如果某些逻辑需要，scaler可以与最接近值相差一些小幂次方数。

- geza

你对能够高效编译为x86的可移植的纯C感兴趣吗？或者你还对内置SIMD指令的C感兴趣，例如AVX512_mm512_getexp_pd（提取指数作为 double）和 _mm512_scalef_pd（https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2403,6062,4147,4841,4841&techs=SSE2,SSE4_2,AVX,AVX2,AVX_512,Other&text=_mm512_scalef_pd），它执行 dst [63：0]：= tmp_src1 [63：0] * POW（2，FLOOR（tmp_src2 [63：0]））（即将双精度浮点数的整数部分加到另一个浮点数的指数域中）？ - Peter Cordes

@PeterCordes：好的，谢谢你提供的信息。我有一个夹子，然后需要根据夹子创建一个双精度浮点数。我预计这里可能会有一些位运算技巧。如果夹子不修改值，那么我只需要进行简单的“与”操作（清除有效数字）。如果夹子进行了修改，那么就需要进行更多的操作。也许有一种非常聪明的方法可以将这些“if”语句压缩成一些巧妙的东西。 - geza

@PatriciaShanahan：有一个标准预定义的宏可以告诉你是或不是。 - Ben Voigt

@PatriciaShanahan: 是的。 - geza

1

@PeterCordes 关于三元运算符和 minsd、minpd、maxsd、maxpd，一些方面上clang比gcc表现更好。Godbolt链接 - wim

显示剩余7条评论

3个回答

3

根据wim的答案，这里提供另一种解决方案，可以更快，因为它少了一个指令。输出略有不同，但仍符合要求。

想法是使用位运算来修复边界情况: 将01放在指数的lsb中，无论其值如何。因此，指数:

0变成1（-1023变成-1022）
2046变成2045（1023变成1022）
其他指数也被修改了，但只是稍微地：与wim的解决方案相比，数字可能会变为两倍大（当指数lsb从00改变为01时），或者减半（当10->01）或1/4（当11->01）

因此，这个修改后的程序可行（我认为使用仅有的2个快速汇编指令解决问题非常酷）：

#include<stdio.h>
#include<stdint.h>
#include<immintrin.h>
/* gcc -Wall -m64 -O3 -march=sandybridge dbl_scale.c */

union dbl_int64{
    double d;
    uint64_t i;
};

double get_scale(double t){
    union dbl_int64 x;
    uint64_t and_i;
    uint64_t or_i;
         /* 0xFEDCBA9876543210 */
    and_i = 0x7FD0000000000000ull;
    or_i =  0x0010000000000000ull;
    x.d = t;
    x.i = (x.i & and_i)|or_i;                     /* Set fraction bits to zero, take absolute value */
    return x.d;
}

double get_scale_x86(double t){
    __m128d x = _mm_set_sd(t);
    __m128d x_and = _mm_castsi128_pd(_mm_set1_epi64x(0x7FD0000000000000ull));
    __m128d x_or  = _mm_castsi128_pd(_mm_set1_epi64x(0x0010000000000000ull));
            x     = _mm_and_pd(x, x_and);
            x     = _mm_or_pd(x, x_or);
    return _mm_cvtsd_f64(x);
}

/* Compute the inverse 1/t of a double t with all zero fraction bits     */
/* and exponent between the limits of function get_scale                 */
/* A single integer subtraction is much less expensive than a            */
/* floating point division.                                               */
double inv_of_scale(double t){
    union dbl_int64 x;
                     /* 0xFEDCBA9876543210 */
    uint64_t inv_mask = 0x7FE0000000000000ull;
    x.d = t;
    x.i = inv_mask - x.i;
    return x.d;
}

double inv_of_scale_x86(double t){
    __m128i inv_mask = _mm_set1_epi64x(0x7FE0000000000000ull);
    __m128d x        = _mm_set_sd(t);
    __m128i x_i      = _mm_sub_epi64(inv_mask, _mm_castpd_si128(x));
    return _mm_cvtsd_f64(_mm_castsi128_pd(x_i));
}


int main(){
    int n = 14;
    int i;
    /* Several example values, 4.94e-324 is the smallest subnormal */
    double y[14] = { 4.94e-324, 1.1e-320,  1.1e-300,  1.1e-5,  0.7,  1.7,  123.1, 1.1e300,  
                     1.79e308, -1.1e-320,    -0.7, -1.7, -123.1,  -1.1e307};
    double z, s, u;

    printf("Portable code:\n");
    printf("             x       pow_of_2        inverse       pow2*inv      x*inverse \n");
    for (i = 0; i < n; i++){  
        z = y[i];
        s = get_scale(z);
        u = inv_of_scale(s);
        printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
    }

    printf("\nx86 specific SSE code:\n");
    printf("             x       pow_of_2        inverse       pow2*inv      x*inverse \n");
    for (i = 0; i < n; i++){  
        z = y[i];
        s = get_scale_x86(z);
        u = inv_of_scale_x86(s);
        printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
    }

    return 0;
}

- geza

@PeterCordes：是的，实际上只有两个指令 :) 我很高兴我能从wim的版本中删掉一条指令，但我没有注意到第一条指令实际上并不是必需的。谢谢你提供的链接，我会查看的！ - geza

1

你可能想添加一个接受两个输入并返回用于两者的比例因子的函数版本，因为正如我所说，您可以将最大幅度查找与清零有效数字相结合。 - Peter Cordes

不错的解决方案。我认为你的 get_scale 函数比我的有点不稳定，但是对于你的应用程序来说可能效果很好。 - wim

@wim：请记住，OP的用例只是将值缩放到一个大小，在那里平方它们不会溢出到无限大或下溢到denormal / zero，对于类似sqrt(a^2+b^2)等的东西。将较大的值留在[1..2）范围内而不是[0.5 .. 1)完全没有问题。使用“OR”应用最小值对于这种用例来说真的是个好主意。 - Peter Cordes

1

@PeterCordes：通过AND/OR的方法，普通值可以缩放到[0.5…8.0)范围内，这对于可靠地计算直角三角形斜边长度来说非常完美。 - wim

显示剩余3条评论

2

您可以使用

double frexp (double x, int* exp);

返回值是x的小数部分，exp是指数（减去偏移量）。

另外，以下代码获取double类型的指数部分。

int get_exp(double *d) {
  long long *l = (long long *) d;
  return ((*l & (0x7ffLL << 52) )>> 52)-1023 ;
}

- Alain Merigot

1

frexp在这里做了更多的工作。但是它做得不够，因为我需要夹紧exp，然后我需要转换回来得到一个double。我认为在我的情况下frexp并不真正可用，因为我需要速度。如果可以的话，我宁愿手动提取指数（只需进行memcpy、移位和掩码操作）。 - geza

@geza：frexp 只是一个移位和掩码操作，标准化并记录下来以使其具有可移植性。如果您想要调整指数，请将其与 ldexp 配对使用（请注意，ldexp 会增加指数而不是替换它）。 - Ben Voigt

1

@BenVoigt：不幸的是，没有。查看源代码。它处理nan/inf和次正常数。它还执行一些逻辑，这可能是我的if (exp<-1022) ..逻辑的一部分。因此，使用frexp，我将有冗余代码。我并不是说它非常慢。但是，如果这是正确的方法，手动提取指数仍然更好（对我来说）。 - geza

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- wim · Accepted Answer

函数s = get_scale(z)计算"最接近2的幂次方"。由于s的小数位为零，因此s的倒数只需执行一个（廉价的）整数减法：参见函数inv_of_scale。

在x86上，get_scale和inv_of_scale编译成相当高效的汇编代码与clang。编译器clang将三元运算符转换为minsd和maxsd，另请参阅Peter Cordes的评论。对于gcc而言，将这些函数转换为x86内置代码(get_scale_x86和inv_of_scale_x86)稍微更有效一些，具体请看Godbolt。

注意，C明确允许通过联合类型(wording)进行类型强制转换，但C++(c++11)不允许。虽然gcc8.2和clang7.0不会抱怨联合，但你可以通过使用memcpy技巧而不是联合技巧来提高C++的移植性。修改代码应该很简单。代码应正确处理子规范数。

#include<stdio.h>
#include<stdint.h>
#include<immintrin.h>
/* gcc -Wall -m64 -O3 -march=sandybridge dbl_scale.c */

union dbl_int64{
    double d;
    uint64_t i;
};

double get_scale(double t){
    union dbl_int64 x;
    union dbl_int64 x_min;
    union dbl_int64 x_max;
    uint64_t mask_i;
           /* 0xFEDCBA9876543210 */
    x_min.i = 0x0010000000000000ull;
    x_max.i = 0x7FD0000000000000ull;
    mask_i =  0x7FF0000000000000ull;
    x.d = t;
    x.i = x.i & mask_i;                    /* Set fraction bits to zero, take absolute value */
    x.d = (x.d < x_min.d) ? x_min.d : x.d; /* If subnormal: set exponent to 1                */
    x.d = (x.d > x_max.d) ? x_max.d : x.d; /* If exponent is very large: set exponent to 7FD, otherwise the inverse is a subnormal */
    return x.d;
}

double get_scale_x86(double t){
    __m128d x = _mm_set_sd(t);
    __m128d x_min = _mm_castsi128_pd(_mm_set1_epi64x(0x0010000000000000ull));
    __m128d x_max = _mm_castsi128_pd(_mm_set1_epi64x(0x7FD0000000000000ull));
    __m128d mask  = _mm_castsi128_pd(_mm_set1_epi64x(0x7FF0000000000000ull));
            x     = _mm_and_pd(x, mask);
            x     = _mm_max_sd(x, x_min);
            x     = _mm_min_sd(x, x_max);
    return _mm_cvtsd_f64(x);
}

/* Compute the inverse 1/t of a double t with all zero fraction bits     */
/* and exponent between the limits of function get_scale                 */
/* A single integer subtraction is much less expensive than a            */
/* floating point division.                                               */
double inv_of_scale(double t){
    union dbl_int64 x;
                     /* 0xFEDCBA9876543210 */
    uint64_t inv_mask = 0x7FE0000000000000ull;
    x.d = t;
    x.i = inv_mask - x.i;
    return x.d;
}

double inv_of_scale_x86(double t){
    __m128i inv_mask = _mm_set1_epi64x(0x7FE0000000000000ull);
    __m128d x        = _mm_set_sd(t);
    __m128i x_i      = _mm_sub_epi64(inv_mask, _mm_castpd_si128(x));
    return _mm_cvtsd_f64(_mm_castsi128_pd(x_i));
}


int main(){
    int n = 14;
    int i;
    /* Several example values, 4.94e-324 is the smallest subnormal */
    double y[14] = { 4.94e-324, 1.1e-320,  1.1e-300,  1.1e-5,  0.7,  1.7,  123.1, 1.1e300,  
                     1.79e308, -1.1e-320,    -0.7, -1.7, -123.1,  -1.1e307};
    double z, s, u;

    printf("Portable code:\n");
    printf("             x       pow_of_2        inverse       pow2*inv      x*inverse \n");
    for (i = 0; i < n; i++){  
        z = y[i];
        s = get_scale(z);
        u = inv_of_scale(s);
        printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
    }

    printf("\nx86 specific SSE code:\n");
    printf("             x       pow_of_2        inverse       pow2*inv      x*inverse \n");
    for (i = 0; i < n; i++){  
        z = y[i];
        s = get_scale_x86(z);
        u = inv_of_scale_x86(s);
        printf("%14e %14e %14e %14e %14e\n", z, s, u, s*u, z*u);
    }

    return 0;
}

输出结果看起来很好：

Portable code:
             x       pow_of_2        inverse       pow2*inv      x*inverse 
 4.940656e-324  2.225074e-308  4.494233e+307   1.000000e+00   2.220446e-16
 1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00   4.942713e-13
 1.100000e-300  7.466109e-301  1.339386e+300   1.000000e+00   1.473324e+00
  1.100000e-05   7.629395e-06   1.310720e+05   1.000000e+00   1.441792e+00
  7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00   1.400000e+00
  1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00   1.700000e+00
  1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00   1.923437e+00
 1.100000e+300  6.696929e+299  1.493222e-300   1.000000e+00   1.642544e+00
 1.790000e+308  4.494233e+307  2.225074e-308   1.000000e+00   3.982882e+00
-1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00  -4.942713e-13
 -7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00  -1.400000e+00
 -1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00  -1.700000e+00
 -1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00  -1.923437e+00
-1.100000e+307  5.617791e+306  1.780059e-307   1.000000e+00  -1.958065e+00

x86 specific SSE code:
             x       pow_of_2        inverse       pow2*inv      x*inverse 
 4.940656e-324  2.225074e-308  4.494233e+307   1.000000e+00   2.220446e-16
 1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00   4.942713e-13
 1.100000e-300  7.466109e-301  1.339386e+300   1.000000e+00   1.473324e+00
  1.100000e-05   7.629395e-06   1.310720e+05   1.000000e+00   1.441792e+00
  7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00   1.400000e+00
  1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00   1.700000e+00
  1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00   1.923437e+00
 1.100000e+300  6.696929e+299  1.493222e-300   1.000000e+00   1.642544e+00
 1.790000e+308  4.494233e+307  2.225074e-308   1.000000e+00   3.982882e+00
-1.099790e-320  2.225074e-308  4.494233e+307   1.000000e+00  -4.942713e-13
 -7.000000e-01   5.000000e-01   2.000000e+00   1.000000e+00  -1.400000e+00
 -1.700000e+00   1.000000e+00   1.000000e+00   1.000000e+00  -1.700000e+00
 -1.231000e+02   6.400000e+01   1.562500e-02   1.000000e+00  -1.923437e+00
-1.100000e+307  5.617791e+306  1.780059e-307   1.000000e+00  -1.958065e+00

向量化

如果编译器支持自动向量化，则函数get_scale应该进行向量化。以下代码片段在clang下向量化效果很好（无需编写SSE/AVX指令代码）。

/* Test how well get_scale vectorizes: */
void get_scale_vec(double * __restrict__ t, double * __restrict__ x){
    int n = 1024;
    int i;
    for (i = 0; i < n; i++){
        x[i] = get_scale(t[i]);
    }
}

很不幸，gcc找不到vmaxpd和vminpd指令。