浮点数比较再探讨

Question

浮点数比较再探讨

c++floating-pointlanguage-lawyerieee-754

28

这个话题在StackOverflow上已经被提及多次，但我相信这是一个新的观点。是的，我阅读了Bruce Dawson的文章和计算机科学家应该知道的浮点运算知识以及这个不错的答案。

据我理解，在典型系统上，比较浮点数是否相等时存在四个基本问题：

浮点计算并不精确
a-b 是否“小”取决于 a 和 b 的规模
a-b 是否“小”取决于 a 和 b 的类型（例如：float、double、long double）
浮点通常具有 +-infinity、NaN 和非规格化表示，其中任何一种都可能干扰天真的公式

这个答案--也被称为“谷歌方法”--似乎很受欢迎。它可以处理所有棘手的情况。它确实非常精确地比较两个值是否相差ULPs个单位。因此，例如，一个非常大的数与无穷大“几乎相等”。

然而：

在我看来，它非常混乱。
它不是特别可移植，严重依赖于内部表示，使用联合从浮点数中读取位等。
它只处理单精度和双精度IEEE 754（特别是没有x86长双精度）

我想要类似的东西，但使用标准C++并处理长双精度。通过“标准”，我指的是如果可能的话C++03，如果必要的话C++11。

这是我的尝试。

#include <cmath>
#include <limits>
#include <algorithm>

namespace {
// Local version of frexp() that handles infinities specially.
template<typename T>
T my_frexp(const T num, int *exp)
{
    typedef std::numeric_limits<T> limits;

    // Treat +-infinity as +-(2^max_exponent).
    if (std::abs(num) > limits::max())
    {
        *exp = limits::max_exponent + 1;
        return std::copysign(0.5, num);
    }
    else return std::frexp(num, exp);
}
}

template<typename T>
bool almostEqual(const T a, const T b, const unsigned ulps=4)
{
    // Handle NaN.
    if (std::isnan(a) || std::isnan(b))
        return false;

    typedef std::numeric_limits<T> limits;

    // Handle very small and exactly equal values.
    if (std::abs(a-b) <= ulps * limits::denorm_min())
        return true;

    // frexp() does the wrong thing for zero.  But if we get this far
    // and either number is zero, then the other is too big, so just
    // handle that now.
    if (a == 0 || b == 0)
        return false;

    // Break the numbers into significand and exponent, sorting them by
    // exponent.
    int min_exp, max_exp;
    T min_frac = my_frexp(a, &min_exp);
    T max_frac = my_frexp(b, &max_exp);
    if (min_exp > max_exp)
    {
        std::swap(min_frac, max_frac);
        std::swap(min_exp, max_exp);
    }

    // Convert the smaller to the scale of the larger by adjusting its
    // significand.
    const T scaled_min_frac = std::ldexp(min_frac, min_exp-max_exp);

    // Since the significands are now in the same scale, and the larger
    // is in the range [0.5, 1), 1 ulp is just epsilon/2.
    return std::abs(max_frac-scaled_min_frac) <= ulps * limits::epsilon() / 2;
}

我声称这段代码(a) 处理了所有相关的情况，(b) 对于IEEE-754单精度和双精度与谷歌实现的功能相同，(c) 完全符合C++标准。

这些声明中至少有一个是错误的。我将接受任何证明这一点的答案，最好附带修复方案。一个好的答案应该包括以下内容之一：

- 具体输入差异超过ulps个单位，但此函数返回true（差异越大越好） - 具体输入差异不超过ulps个单位，但此函数返回false（差异越小越好） - 我所错过的任何情况 - 代码依赖未定义行为或因实现定义的行为而破坏的任何方式。（如果可能，请引用相关规范。） - 您确定的任何问题的修复程序 - 在不破坏代码的情况下简化代码的任何方法

我打算在这个问题上放置一个非微不足道的悬赏。

- Nemo

1

http://floating-point-gui.de/ - Pavel Radzivilovsky

1

@Nemo：“Mantissa”在IEEE 754标准中没有出现。您指向的维基百科页面不是定义，而是消歧义页面。它所指的“有效数字”页面有一个关于“尾数”的用法说明。POSIX标准规定了POSIX；它对浮点数或数学并不具有权威性。 - Eric Postpischil

2

有趣，但这是一个真正的问题吗？ - hookenz

@n.m. - 浮点数运算并不代表实数计算，无论如何都无法改变这一点。你要么接受浮点数运算的工作方式，要么花费大量时间试图绕过一个有缺陷的心理模型。 - Pete Becker

使用浮点数进行计算时，其规则与实数不同。当然，这就是它们被称为不精确的原因。 - n. m.

显示剩余11条评论

4个回答

2

简化：你可以先完全忽略所有不是有限数的情况，从而避免使用my_frexp函数：

if( ! std::isfinite(a) || ! std::isfinite(b) )
    return a == b;

看起来isfinite至少在C++11中存在

编辑然而，如果意图是让limits::infinity()与limits::max()的1 ulp之内
那么上面的简化并不成立，但是我的my_frexp()函数应该返回*exp的limits::max_exponent+1，而不是max_exponent+2吗？

- aka.nice

// 非有限浮点数要么相等，要么不相等，而不是几乎相等 - aka.nice

提问者说，“例如，一个非常大的数字与无穷大相比‘几乎相等’”，并带有明显的赞同语气。 - Steve Jessop

@SteveJessop 嗯，好观点，但是 my_frexp 应该在 *exp 中返回 max_exponent + 1，而不是 +2，这让我误入歧途 :( - aka.nice

@Steve - 实际上，这个问题的精神是模仿（显然很受欢迎的）谷歌方法，但是要可移植。我会放弃完全模仿，以换取在可移植性、速度或可读性方面的重大改进...但如果我两者都能兼得，那就太好了。@aka.nice - max_exponent + 2 是正确的，因为 frexp 返回的分数部分在范围 [0.5, 1) 内，而不是 [1, 2)。 - Nemo

1

@Nemo 我明白，但是2^max_exponent仍然大于max()，所以为什么是2^(max_exponent+1)？它比max()的两倍还要多。 - aka.nice

@aka.nice：好观点（+1）。我脑海中对max_exponent的定义有误。已经修正，谢谢。 - Nemo

1

未来证明：如果您将来想要将此类比较扩展到十进制浮点数http://en.wikipedia.org/wiki/Decimal64_floating-point_format，并且假设ldexp()和frexp()将使用正确的基数处理此类类型，则严格来说，在return std::copysign(0.5, num);中的0.5应该被替换为T(1)/limits::radix() - 或者std::ldexp(T(1),-1)或其他什么...（我在std::numeric_limits中找不到方便的常量）

编辑正如Nemo所评论的那样，假设ldexp和frexp将使用正确的FLOAT_RADIX是错误的，它们将坚持2...

因此，未来证明的可移植版本也应使用：

使用std::scalbn(x,n)代替std::ldexp(x,n)
exp=std::ilogb(std::abs(x)),y=std::scalbn(x,-exp)代替y=frexp(x,&exp)
现在，上述y的范围是[1，FLOAT_RADIX)，而不是[T(1)/Float_Radix,1)，对于my_frexp无限情况，返回copysign(T(1),num)，并测试ulps*limits::epsilon()而不是ulps*epsilon()/2

这也需要C++11或更高版本的标准

- aka.nice

1

好观点。但要真正做到未来兼容，我需要使用C++11的std::ilogb和std::scalb来在正确的基数下完成所有操作。在我工作的地方，C++11不是一个选择，并且可能在这十年内都不会成为选项...然而，一个完全可移植的C++11实现似乎是可能的，也值得准备一下。想试着在这个答案中添加吗？ - Nemo

0

你的方法有两个缺点，一个是你的实现依赖于<cmath>中的几个函数，这些函数显然只适用于标准中找到的浮点数，而且目前不能保证是constexpr（虽然在GCC中，你的函数的constexpr版本可以编译，但大多数其他编译器不行）。

Google的方法比你的方法更有优势，因为在C++20中，你可以用std::bit_cast替换GoogleTest版本中的union（或reinterpret_cast），从而得到一个constexpr函数。此外，通过检查std::numeric_limits<T>::is_iec559并使用新引入的std::endian，你应该能够确保所假定的内部表示是正确的。通过引入两个自定义特性，你可以获得很多灵活性，从而将该函数扩展到非标准浮点类型。

用于确定等长无符号整数的一个特性（在我的代码中称为UIntEquiv）：通过添加专门化（例如template<> class UIntEquiv<16>）并使用GCC的__uint128_t或Boost Multiprecision，您可以将代码扩展到任意长度的浮点数。
另一个特性是实际的浮点类型（在我的代码中称为FloatTrait），包括其符号、尾数和指数掩码，以及表示无穷大和NaN的方式。这样，您可以通过添加另一个专门化来将比较扩展到非C++标准浮点数，如float256。

我在下方插入了来自我的Gist的代码:

#include <bit>
#include <concepts>
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <limits>
#include <type_traits>
#include <utility>
#include <vector>


namespace detail {
  // Trait for excluding incomplete types: See https://dev59.com/qFcP5IYBdhLWcg3wi6gA#44229779
  template <typename T, std::size_t = sizeof(T)>
  consteval std::true_type is_complete(T*) noexcept;

  consteval std::false_type is_complete(...) noexcept;
}

template <typename T>
using is_complete = decltype(detail::is_complete(std::declval<T*>()));

template <typename T>
static constexpr bool is_complete_v = is_complete<T>::value;

namespace detail {

  // Class for determining the corresponding unsigned integer type with equal length to the floating type
  template <std::size_t N>
  class UIntEquiv {
    protected:
      UIntEquiv() = delete;
      UIntEquiv(UIntEquiv const&) = delete;
      UIntEquiv(UIntEquiv&&) = delete;
      UIntEquiv& operator= (UIntEquiv const&) = delete;
      UIntEquiv& operator= (UIntEquiv&&) = delete;

      template<std::size_t M, typename std::enable_if_t<(M==sizeof(std::uint8_t))>* = nullptr>
      static consteval std::uint8_t determineUIntType() noexcept;
      template<std::size_t M, typename std::enable_if_t<(M==sizeof(std::uint16_t))>* = nullptr>
      static consteval std::uint16_t determineUIntType() noexcept;
      template<std::size_t M, typename std::enable_if_t<(M==sizeof(std::uint32_t))>* = nullptr>
      static consteval std::uint32_t determineUIntType() noexcept;
      template<std::size_t M, typename std::enable_if_t<(M==sizeof(std::uint64_t))>* = nullptr>
      static consteval std::uint64_t determineUIntType() noexcept;

    public:
      using type = decltype(determineUIntType<N>());
  };
  
  // You can potentially add specialisation of UIntEquiv for longer unsigned integer types here (e.g. for long double support).
  // e.g. GCC's __uint128_t: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html
  //      or Boost: https://www.boost.org/doc/libs/1_67_0/libs/multiprecision/doc/html/boost_multiprecision/tut/ints/cpp_int.html
  //
  // template <>
  // class UIntEquiv<sizeof(__uint128_t)> {
  //   public:
  //     using type = __uint128_t;
  // };
  //
  // As long as std::numeric_limits<T> is specialized for the corresponding floating type and your architecture respects IEEE754 and stores
  // your floating point numbers with little-endian the code should compile correctly.
  // Therefore in case you have particular proprietary floating types with a different mantissa and exponent such as 
  // e.g. GCC's __float128: https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html
  // the fastest solution is probably to specialize the std::numeric_limits<T> trait yourself.
  // Boost should already provide the fully specialized traits: https://www.boost.org/doc/libs/1_65_1/libs/multiprecision/doc/html/boost_multiprecision/tut/floats/float128.html

  template <std::size_t N>
  using UIntEquiv_t = typename UIntEquiv<N>::type;

  // In case your floating type does not respect IEEE754 or is not stored with little endian you will have to specialise the entire 
  // FloatTrait yourself:
  template <typename T>
  class FloatTrait;

  // Specialised trait for floating point number types according to IEEE754 stored with little endian
  template <typename T>
  requires std::is_floating_point_v<T> && std::numeric_limits<T>::is_iec559 && (std::endian::native == std::endian::little)
  class FloatTrait<T> {
    public:
      static constexpr std::size_t number_of_bytes {sizeof(T)};
      static constexpr std::size_t number_of_bits {number_of_bytes*std::numeric_limits<std::uint8_t>::digits};
      using Bytes = UIntEquiv_t<number_of_bytes>;

      static constexpr std::size_t number_of_sign_bits {1};
      static constexpr std::size_t number_of_fraction_bits {std::numeric_limits<T>::digits-1};
      static constexpr std::size_t number_of_exponent_bits {number_of_bits - number_of_sign_bits - number_of_fraction_bits};

      static constexpr Bytes sign_mask {Bytes{1} << (number_of_bits - 1)};
      static constexpr Bytes fraction_mask {~Bytes{0} >> (number_of_exponent_bits + 1)};
      static constexpr Bytes exponent_mask {~(sign_mask | fraction_mask)};

      static constexpr bool isNan(T const t) noexcept {
        auto const bytes {std::bit_cast<Bytes>(t)};
        auto const exponent_bytes {extractExponent(bytes)};
        auto const fraction_bytes {extractFraction(bytes)};
        return (exponent_bytes == exponent_mask) && (fraction_bytes != 0);
      }

      static constexpr bool isPosInf(T const t) noexcept {
        return isPos(t) && isInf(t);
      }

      static constexpr bool isNegInf(T const t) noexcept {
        return isNeg(t) && isInf(t);
      }

      static constexpr bool isNeg(T const t) noexcept {
        auto const bytes {std::bit_cast<Bytes>(t)};
        auto const sign_bytes {extractSign(bytes)};
        return sign_bytes != 0;
      }

      // Optional helper functions
      static constexpr bool isPos(T const t) noexcept {
        auto const bytes {std::bit_cast<Bytes>(t)};
        auto const sign_bytes {extractSign(bytes)};
        return sign_bytes == 0;
      }

      static constexpr bool isInf(T const t) noexcept {
        auto const bytes {std::bit_cast<Bytes>(t)};
        auto const exponent_bytes {extractExponent(bytes)};
        auto const fraction_bytes {extractFraction(bytes)};
        return (exponent_bytes == exponent_mask) && (fraction_bytes == 0);
      }

      static constexpr Bytes extractSign(Bytes const bytes) noexcept {
        return bytes & sign_mask;
      }

      static constexpr Bytes extractExponent(Bytes const bytes) noexcept {
        return bytes & exponent_mask;
      }

      static constexpr Bytes extractFraction(Bytes const bytes) noexcept {
        return bytes & fraction_mask;
      }

    protected:
      FloatTrait() = delete;
      FloatTrait(FloatTrait const&) = delete;
      FloatTrait(FloatTrait&&) = delete;
      FloatTrait& operator= (FloatTrait const&) = delete;
      FloatTrait& operator= (FloatTrait&&) = delete;
  };

  template <typename T>
  requires is_complete_v<FloatTrait<T>>
  class FloatView {
    public:
      using Trait = FloatTrait<T>;
      using Bytes = typename FloatTrait<T>::Bytes;

      explicit constexpr FloatView(T const t) noexcept
        : value{t} {
        return;
      }
      FloatView() = default;
      FloatView(FloatView const&) = default;
      FloatView(FloatView&&) = default;
      FloatView& operator= (FloatView const&) = default;
      FloatView& operator= (FloatView&&) = default;

      constexpr bool isAlmostEqual(FloatView const rhs, std::uint8_t const max_distance = 4) const noexcept {
        if (Trait::isNan(value) || Trait::isNan(rhs.value)) {
          return false;
        } else if (Trait::isNegInf(value) != Trait::isNegInf(rhs.value)) {
          return false;
        } else if (Trait::isPosInf(value) != Trait::isPosInf(rhs.value)) {
          return false;
        }
        return computeDistance(value, rhs.value) <= max_distance;
      }

    protected:
      T value;

      static constexpr Bytes signMagnitudeToBiased(T const t) noexcept {
        auto const b {std::bit_cast<Bytes>(t)};
        if (Trait::isNeg(t)) {
          return ~b + Bytes{1};
        } else {
          return Trait::sign_mask | b;
        }
      }

      static constexpr Bytes computeDistance(T const a, T const b) noexcept {
        auto const biased1 = signMagnitudeToBiased(a);
        auto const biased2 = signMagnitudeToBiased(b);
        return (biased1 >= biased2) ? (biased1 - biased2) : (biased2 - biased1);
      }
  };

}

template <typename T>
constexpr bool isAlmostEqual(T const lhs, T const rhs, std::uint8_t const max_distance = 4) noexcept {
  detail::FloatView<T> const a {lhs};
  detail::FloatView<T> const b {rhs};
  return a.isAlmostEqual(b, max_distance);
}

在这里试试吧！

- 2b-t

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Eric Postpischil · Accepted Answer

“Almost Equals”不是一个好函数

4并不是合适的默认值：你所指向的答案声称“因此，对于普通用途来说，4应该足够了”，但却没有提供任何依据。实际上，在浮点数通过不同方法计算时，即使它们在精确计算中相等，它们之间可能会相差很多ULP，这种情况在普通情况下也会发生。因此，容差不能有默认值；每个用户都应该根据对其代码进行彻底分析的基础上自行设定。

以1./49*49-1为例，数学上的确切结果为0，但计算出的结果（64位IEEE 754二进制）为2⁻⁵³，误差超过10³⁰⁷ ULP的确切结果，并且几乎为计算结果的10¹⁶ ULP。

有时候，不存在合适的值：在某些情况下，容差不能相对于所比较的值，无论是数学上确切的相对容差还是量化的ULP容差。例如，在FFT中，几乎每个输出值都受到几乎每个输入值的影响，任何一个元素的误差都与其他元素的大小有关。因此，我们可能会得到的误差量无法从所比较的两个值计算出来；它是FFT中其他数据的函数。必须向“几乎相等”例程提供有关潜在误差的其他上下文信息。

“Almost Equals”具有较差的数学特性：这表明了“几乎相等”的一些缺点：缩放会改变结果。以下代码打印1和0，说明1.1和1.1 + 3*0x1p-52是“几乎相等”的，但是乘以.8的同样值并不是“几乎相等”的。

double x0 = 1.1;
double x1 = 1.1 + 3*0x1p-52;
std::cout << almostEqual(x0, x1) << "\n";
x0 *= .8;
x1 *= .8;
std::cout << almostEqual(x0, x1) << "\n";

该算法的另一个失败之处在于它不是传递的；almostEqual(a, b)和almostEqual(b, c)并不能推出almostEqual(a, c)。

极端情况下的错误

almostEqual(1.f, 1.f/11, 0x745d17)返回错误的结果 1。

1.f/11 等于 0x1.745d18p-4（十六进制浮点数表示法，意味着1.745d18₁₆•2⁻⁴）。将此值从 1 减去（即 0x10p-4），得到 0xe.8ba2e8p-4。由于 ULP 值为 1 时等于 0x1p-23，因此 0xe.8ba2e8p19 ULP 等于 0xe8ba2e.8/2 ULP（向左移动 20 位并除以 2，共 19 位）= 0x745d17.4 ULP。这超过了指定的公差 0x745d17，因此正确答案应为 0。

这个错误是由于在 max_frac-scaled_min_frac 中进行了四舍五入造成的。

一个简单的解决方法是规定 ulps 必须小于 .5/limits::epsilon。然后，仅当差异（即使经过舍入）超过 ulps 时，才会在 max_frac-scaled_min_frac 中进行舍入；如果差异小于该值，则减法是精确的，根据 Sterbenz 引理。

有一个建议使用 long double 来纠正这个问题。但是，long double 不会纠正这个问题。考虑将 1 和 -0x1p-149f 与 ulps 设置为 1/limits::epsilon 进行比较。除非您的尾数有 149 位，否则减法结果舍入为 1，小于或等于 1/limits::epsilon ULP。然而，数学上的差异显然超过了 1。

小注释

表达式 factor * limits::epsilon / 2 将 factor 转换为浮点类型，这会导致在无法准确表示大值的情况下出现舍入误差。很可能，该例程并不适用于这样的大值（在 float 中数百万个 ULP），因此应该将其规定为该例程的限制而不是错误。