C编程中的定点算术

39

我正在尝试创建一个存储股票价格的应用程序,并保证高精度。目前我正在使用double来实现这一点,为了节省内存,我可以使用其他数据类型吗?我知道这与定点算术有关,但我无法理解。


2
https://dev59.com/pnVD5IYBdhLWcg3wHn2d - Corbin
5
另一个问题是关于C++,而不是C。在C中编写类将行不通。 - Jonathan Leffler
1
你可能会在这里找到一些关于C语言的有用信息:为什么C99没有包括定点类型 - Jonathan Leffler
你必须在精度和内存消耗之间做出良好的权衡...实际上,如果你正在处理股票,甚至可能需要比double类型更多的内存。 - UmNyobe
啊,抱歉。我应该更好地阅读问题。 - Corbin
1
需要表示的数字范围是多少?据我所知,大多数股票交易所已经停止使用分数(有理数)。我认为他们已经转向100分之一(两位小数)。您是否需要比这更高的精度?主要是存储还是程序需要进行大量的数学计算?如果需要进行数学计算,必须是中缀运算符(+,-,*,/)还是函数(add(),subtract()...)可以使用? - gbulmer
4个回答

98

定点算术背后的思想是将值乘以一定数量存储,在所有计算中使用乘以后的值,并在需要结果时除以相同的数量。这种技术的目的是使用整数算术(int,long ...),同时能够表示分数。

C语言中通常且最有效的做法是使用位移运算符(<<和>>)。位移是一种对ALU来说非常简单和快速的操作,进行这种操作具有将整数值乘以2(<<)和除以2(>>)的属性(此外,许多位移可以以单个位移的价格完成)。当然,缺点是乘数必须是2的幂次方(这通常本身不是问题,因为我们并不真正关心确切的乘数值)。

现在假设我们要使用32位整数存储我们的值。我们必须选择2的幂次方乘数。让我们将蛋糕分成两半,比如说65536(这是最常见的情况,但您可以根据精度需求使用任何2的幂次方)。这是2的16次方,这里的16意味着我们将使用16个最低有效位(LSB)表示小数部分。其余的32-16 = 16位用于最高有效位(MSB),即整数部分。

     integer (MSB)    fraction (LSB)
           v                 v
    0000000000000000.0000000000000000

让我们用代码来表达:

#define SHIFT_AMOUNT 16 // 2^16 = 65536
#define SHIFT_MASK ((1 << SHIFT_AMOUNT) - 1) // 65535 (all LSB set, all MSB clear)

int price = 500 << SHIFT_AMOUNT;

这是您必须放入存储(结构,数据库或其他)中的值。请注意,即使在C语言中int通常为32位,但它不一定是32位。默认情况下,它是带符号的,可以在声明中添加unsigned以确保它是无符号的。更好的方法是,如果您的代码高度依赖于整数位大小(您可能会介绍一些黑客技巧),则可以使用stdint.h中声明的uint32_t或uint_least32_t。如果不确定,使用typedef为您的定点类型,这样更安全。

当您想要对这个值进行计算时,可以使用四个基本运算符:+,-,*和/。您必须记住,在加和减一个值(+和-)时,也必须将该值移位。假设我们要将10加到我们的500价格中:

price += 10 << SHIFT_AMOUNT;

但是对于乘法和除法(*和/),乘数/除数不能移位。假设我们想要乘以3:

price *= 3;

现在让我们更有趣一些,将价格除以4,这样我们就可以弥补非零小数部分:

price /= 4; // now our price is ((500 + 10) * 3) / 4 = 382.5

这就是所有的规则。如果你想在任何时候检索真实价格,你必须进行右移:

printf("price integer is %d\n", price >> SHIFT_AMOUNT);

如果您需要小数部分,您必须屏蔽它:

printf ("price fraction is %d\n", price & SHIFT_MASK);
当然,这个值不是我们所说的十进制小数,实际上它是范围在[0-65535]之间的整数。但它与十进制小数范围[0-0.9999 ...]完全匹配。换句话说,映射看起来像:0 => 0,32768 => 0.5,65535 => 0.9999 ...。
将其视为十进制小数的简单方法是在此时使用C内置的浮点算术:
printf("price fraction in decimal is %f\n", ((double)(price & SHIFT_MASK) / (1 << SHIFT_AMOUNT)));

但如果您没有FPU支持(无论是硬件还是软件),您可以像这样使用您的新技能以获取完整的价值:

printf("price is roughly %d.%lld\n", price >> SHIFT_AMOUNT, (long long)(price & SHIFT_MASK) * 100000 / (1 << SHIFT_AMOUNT));

表达式中0的数量大致对应于小数点后你希望有的数字位数。不要高估0的数量,考虑到你的分数精度(没有真正的陷阱,在很明显)。不要使用simple long,因为sizeof(long)可能等于sizeof(int)。如果int是32位,则使用long long,因为long long保证至少64位(或使用stdint.h中声明的int64_t、int_least64_t等)。换句话说,使用两倍于你固定点类型大小的类型就可以了。最后,如果你没有访问>= 64位类型的权限,也许现在是时候练习模拟它们了,至少对于你的输出来说是这样。

这些是固定点算术背后的基本思想。

注意负值。有时候会变得棘手,特别是当显示最终值的时候。此外,C对有符号整数是实现定义的(即使这是一个问题的平台现在非常不常见)。你应该始终在你的环境中进行最小测试,以确保一切都按预期进行。如果不是这样,如果你知道你在做什么,你可以通过它进行黑客攻击(我不会详细解释,但这与算术移位与逻辑移位以及二进制补码表示有关)。然而,对于无符号整数,无论你做什么,你大多是安全的,因为行为总是良好定义的。

还要注意,如果32位整数不能表示大于232-1的值,则使用216限制的固定点算术会将您的范围限制为216-1!(在有符号整数中将所有这些除以2,这样,在我们的例子中,我们将留下一个可用范围为215-1)。目标是选择适合情况的SHIFT_AMOUNT。这是整数部分幅度和小数部分精度之间的权衡。

现在到真正的警告:这种技术绝对不适用于精度是头等大事的领域(金融、科学、军事...)。通常的浮点数(float/double)也经常不够精确,尽管它们比固定点具有更好的性质。固定点的精度与值无关(在某些情况下,这可能是优势),而浮点数的精度与值的大小成反比(即,数值越低,精度越高...好吧,这比那更复杂,但你明白我的意思)。此外,相当于(按位数计算)整数(固定点或非固定点)的浮点数具有更大的幅度,代价是在高值时失去精度(甚至可以达到一个幅度的位置,在该位置添加1甚至更大的值将不会产生任何效果,这是整数无法发生的事情)。

如果你在这些敏感领域工作,最好使用专门用于任意精度目的的库(可以去看一下gmplib,它是免费的)。在计算机科学中,获得精度的本质是使用来存储值的位数。想要高精度?就用位。仅此而已。


2
我还建议使用fixedptc库sqlite4 decimal实现。源代码在source tree的math.c文件中。 - Bernardo Ramos
不要采用“使用浮点数”的丑陋方法,而是可以直接通过 frac / (1 << shiftamount) 缩放分数。当然,这种除法是不可能的,但诀窍在于首先将 frac 乘以最大十进制值(即99999),然后再除以二进制表示的幂次大小。为了避免溢出,您可以将数字转换为 int64。 - Martin
@Alex,非常好的回答!我决定根据您的信息编写完整的代码教程/示例,甚至将其推向更远。在我的回答中,我还演示了基于整数的“浮点”舍入,以及对“大整数”进行的分数定点数学(即:在数学运算期间不允许使用较大的类型,否则会导致溢出)。对于任何感兴趣的人,请参见我的答案:https://dev59.com/J2kw5IYBdhLWcg3wMHum#53936802。 - Gabriel Staples
这是一个相当不错的答案,但我有两个小问题:(1) 固定点并不一定要以二进制、按位的方式来完成——在金融应用中,通常使用像100或10000这样的十进制值作为分母;(2) 这并没有真正解决关于节省内存的核心问题——为了获得精度和范围,他们可能需要一个64位整数类型,这与大多数实现中的double类型大小相同。 - Adrian McCarthy
1
为什么要将小数部分乘以10000? - Vanush Vee
显示剩余4条评论

5
我认为有两种选择。如果您正在金融服务行业工作,那么您的代码应符合精度和准确性的标准,所以无论内存成本如何,您都需要遵循这些标准。我知道这个行业通常资金充足,所以购买更多的内存不应该是问题。 :)
如果此代码是用于个人使用,那么为了最大限度地提高精度,建议您在存储前将所有价格乘以固定因子并使用整数进行计算。例如,如果您想要达到每分钱的精度(可能还不够好),请将所有价格乘以100,以便您的单位实际上是美分而不是美元,并从那里开始。如果您想要更高的精度,请乘以更多。例如,要准确到一分钱(我听说经常应用的标准),请将价格乘以10000(100 * 100)。
现在,对于32位整数,乘以10000会使大量美元金额无法表示。实际的32位限制为20亿美元,这意味着仅能表示高达20000美元的价格:2000000000/10000 = 20000。如果将20000乘以某些值,则可能没有空间来保存结果。因此,我建议使用64位整数(long long)。即使将所有价格乘以10000,仍然有足够的空间容纳大的值,甚至在乘法中也是如此。
固定点的诀窍在于,每当进行计算时,您需要记住每个值实际上是乘以一个常数的基础值。在加法或减法之前,您需要使用较小的常数乘以值,以匹配具有较大常数的值。在乘法之后,您需要除以某个值,以将结果恢复为所需常数的乘积。如果您使用的常数不是2的幂,则必须进行整数除法,这在时间上是昂贵的。许多人使用2的幂作为它们的常数,因此它们可以移位而不是分割。
如果所有这些看起来很复杂,那就是这样。我认为最简单的方法是使用双精度浮点数,并在需要时购买更多的RAM。它们具有53位精度,约为9千万亿,或近16个十进制数字。是的,当您处理数十亿美元时,可能仍会失去一些美分,但如果您关心这一点,那么您并不是以正确的方式成为亿万富翁。 :)

5

@Alex在这里给出了一个很棒的答案。然而,我想补充一些改进,例如演示如何对任意小数位数进行模拟浮点(使用整数来像浮点数一样运算)舍入。我在下面的代码中演示了这一点。不过我做得更多,最终编写了一个完整的代码教程来学习定点数学。以下是详细内容:

我的定点数学教程:这是一个类似教程的实践代码,用于学习如何进行定点数学、仅使用整数进行“浮点”式打印、像浮点数般的整数舍入以及大整数上的分数定点数学。

如果你真的想学习定点数学,我认为这是有价值的代码需要仔细阅读,但我花了整个周末才写完,所以期望你可能需要几个小时才能彻底阅读所有内容。然而,舍入的基础知识可以在顶部部分找到,并且只需几分钟就可以学会。

我的GitHub上完整的代码:https://github.com/ElectricRCAircraftGuy/fixed_point_math

或者,以下是截断的代码(因为Stack Overflow不允许那么多字符):

/*
fixed_point_math tutorial
- A tutorial-like practice code to learn how to do fixed-point math, manual "float"-like prints using integers only,
  "float"-like integer rounding, and fractional fixed-point math on large integers. 

By Gabriel Staples
www.ElectricRCAircraftGuy.com
- email available via the Contact Me link at the top of my website.
Started: 22 Dec. 2018 
Updated: 25 Dec. 2018 

References:
- https://dev59.com/J2kw5IYBdhLWcg3wMHum

Commands to Compile & Run:
As a C program (the file must NOT have a C++ file extension or it will be automatically compiled as C++, so we will
make a copy of it and change the file extension to .c first):
See here: https://dev59.com/pnA75IYBdhLWcg3wqK10#3206195. 
    cp fixed_point_math.cpp fixed_point_math_copy.c && gcc -Wall -std=c99 -o ./bin/fixed_point_math_c fixed_point_math_copy.c && ./bin/fixed_point_math_c
As a C++ program:
    g++ -Wall -o ./bin/fixed_point_math_cpp fixed_point_math.cpp && ./bin/fixed_point_math_cpp

*/

#include <stdbool.h>
#include <stdio.h>
#include <stdint.h>

// Define our fixed point type.
typedef uint32_t fixed_point_t;

#define BITS_PER_BYTE 8

#define FRACTION_BITS 16 // 1 << 16 = 2^16 = 65536
#define FRACTION_DIVISOR (1 << FRACTION_BITS)
#define FRACTION_MASK (FRACTION_DIVISOR - 1) // 65535 (all LSB set, all MSB clear)

// // Conversions [NEVERMIND, LET'S DO THIS MANUALLY INSTEAD OF USING THESE MACROS TO HELP ENGRAIN IT IN US BETTER]:
// #define INT_2_FIXED_PT_NUM(num)     (num << FRACTION_BITS)      // Regular integer number to fixed point number
// #define FIXED_PT_NUM_2_INT(fp_num)  (fp_num >> FRACTION_BITS)   // Fixed point number back to regular integer number

// Private function prototypes:
static void print_if_error_introduced(uint8_t num_digits_after_decimal);

int main(int argc, char * argv[])
{
    printf("Begin.\n");

    // We know how many bits we will use for the fraction, but how many bits are remaining for the whole number, 
    // and what's the whole number's max range? Let's calculate it.
    const uint8_t WHOLE_NUM_BITS = sizeof(fixed_point_t)*BITS_PER_BYTE - FRACTION_BITS;
    const fixed_point_t MAX_WHOLE_NUM = (1 << WHOLE_NUM_BITS) - 1;
    printf("fraction bits = %u.\n", FRACTION_BITS);
    printf("whole number bits = %u.\n", WHOLE_NUM_BITS);
    printf("max whole number = %u.\n\n", MAX_WHOLE_NUM);

    // Create a variable called `price`, and let's do some fixed point math on it.
    const fixed_point_t PRICE_ORIGINAL = 503;
    fixed_point_t price = PRICE_ORIGINAL << FRACTION_BITS;
    price += 10 << FRACTION_BITS;
    price *= 3;
    price /= 7; // now our price is ((503 + 10)*3/7) = 219.857142857.

    printf("price as a true double is %3.9f.\n", ((double)PRICE_ORIGINAL + 10)*3/7);
    printf("price as integer is %u.\n", price >> FRACTION_BITS);
    printf("price fractional part is %u (of %u).\n", price & FRACTION_MASK, FRACTION_DIVISOR);
    printf("price fractional part as decimal is %f (%u/%u).\n", (double)(price & FRACTION_MASK) / FRACTION_DIVISOR,
           price & FRACTION_MASK, FRACTION_DIVISOR);

    // Now, if you don't have float support (neither in hardware via a Floating Point Unit [FPU], nor in software
    // via built-in floating point math libraries as part of your processor's C implementation), then you may have
    // to manually print the whole number and fractional number parts separately as follows. Look for the patterns.
    // Be sure to make note of the following 2 points:
    // - 1) the digits after the decimal are determined by the multiplier: 
    //     0 digits: * 10^0 ==> * 1         <== 0 zeros
    //     1 digit : * 10^1 ==> * 10        <== 1 zero
    //     2 digits: * 10^2 ==> * 100       <== 2 zeros
    //     3 digits: * 10^3 ==> * 1000      <== 3 zeros
    //     4 digits: * 10^4 ==> * 10000     <== 4 zeros
    //     5 digits: * 10^5 ==> * 100000    <== 5 zeros
    // - 2) Be sure to use the proper printf format statement to enforce the proper number of leading zeros in front of
    //   the fractional part of the number. ie: refer to the "%01", "%02", "%03", etc. below.
    // Manual "floats":
    // 0 digits after the decimal
    printf("price (manual float, 0 digits after decimal) is %u.", 
           price >> FRACTION_BITS); print_if_error_introduced(0);
    // 1 digit after the decimal
    printf("price (manual float, 1 digit  after decimal) is %u.%01lu.", 
           price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 10 / FRACTION_DIVISOR); 
    print_if_error_introduced(1);
    // 2 digits after decimal
    printf("price (manual float, 2 digits after decimal) is %u.%02lu.", 
           price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 100 / FRACTION_DIVISOR); 
    print_if_error_introduced(2);
    // 3 digits after decimal
    printf("price (manual float, 3 digits after decimal) is %u.%03lu.", 
           price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 1000 / FRACTION_DIVISOR); 
    print_if_error_introduced(3);
    // 4 digits after decimal
    printf("price (manual float, 4 digits after decimal) is %u.%04lu.", 
           price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 10000 / FRACTION_DIVISOR); 
    print_if_error_introduced(4);
    // 5 digits after decimal
    printf("price (manual float, 5 digits after decimal) is %u.%05lu.", 
           price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 100000 / FRACTION_DIVISOR); 
    print_if_error_introduced(5);
    // 6 digits after decimal
    printf("price (manual float, 6 digits after decimal) is %u.%06lu.", 
           price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 1000000 / FRACTION_DIVISOR); 
    print_if_error_introduced(6);
    printf("\n");


    // Manual "floats" ***with rounding now***:
    // - To do rounding with integers, the concept is best understood by examples: 
    // BASE 10 CONCEPT:
    // 1. To round to the nearest whole number: 
    //    Add 1/2 to the number, then let it be truncated since it is an integer. 
    //    Examples:
    //      1.5 + 1/2 = 1.5 + 0.5 = 2.0. Truncate it to 2. Good!
    //      1.99 + 0.5 = 2.49. Truncate it to 2. Good!
    //      1.49 + 0.5 = 1.99. Truncate it to 1. Good!
    // 2. To round to the nearest tenth place:
    //    Multiply by 10 (this is equivalent to doing a single base-10 left-shift), then add 1/2, then let 
    //    it be truncated since it is an integer, then divide by 10 (this is a base-10 right-shift).
    //    Example:
    //      1.57 x 10 + 1/2 = 15.7 + 0.5 = 16.2. Truncate to 16. Divide by 10 --> 1.6. Good.
    // 3. To round to the nearest hundredth place:
    //    Multiply by 100 (base-10 left-shift 2 places), add 1/2, truncate, divide by 100 (base-10 
    //    right-shift 2 places).
    //    Example:
    //      1.579 x 100 + 1/2 = 157.9 + 0.5 = 158.4. Truncate to 158. Divide by 100 --> 1.58. Good.
    //
    // BASE 2 CONCEPT:
    // - We are dealing with fractional numbers stored in base-2 binary bits, however, and we have already 
    //   left-shifted by FRACTION_BITS (num << FRACTION_BITS) when we converted our numbers to fixed-point 
    //   numbers. Therefore, *all we have to do* is add the proper value, and we get the same effect when we 
    //   right-shift by FRACTION_BITS (num >> FRACTION_BITS) in our conversion back from fixed-point to regular
    //   numbers. Here's what that looks like for us:
    // - Note: "addend" = "a number that is added to another".
    //   (see https://www.google.com/search?q=addend&oq=addend&aqs=chrome.0.0l6.1290j0j7&sourceid=chrome&ie=UTF-8).
    // - Rounding to 0 digits means simply rounding to the nearest whole number.
    // Round to:        Addends:
    // 0 digits: add 5/10 * FRACTION_DIVISOR       ==> + FRACTION_DIVISOR/2
    // 1 digits: add 5/100 * FRACTION_DIVISOR      ==> + FRACTION_DIVISOR/20
    // 2 digits: add 5/1000 * FRACTION_DIVISOR     ==> + FRACTION_DIVISOR/200
    // 3 digits: add 5/10000 * FRACTION_DIVISOR    ==> + FRACTION_DIVISOR/2000
    // 4 digits: add 5/100000 * FRACTION_DIVISOR   ==> + FRACTION_DIVISOR/20000
    // 5 digits: add 5/1000000 * FRACTION_DIVISOR  ==> + FRACTION_DIVISOR/200000
    // 6 digits: add 5/10000000 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/2000000
    // etc.

    printf("WITH MANUAL INTEGER-BASED ROUNDING:\n");

    // Calculate addends used for rounding (see definition of "addend" above).
    fixed_point_t addend0 = FRACTION_DIVISOR/2;
    fixed_point_t addend1 = FRACTION_DIVISOR/20;
    fixed_point_t addend2 = FRACTION_DIVISOR/200;
    fixed_point_t addend3 = FRACTION_DIVISOR/2000;
    fixed_point_t addend4 = FRACTION_DIVISOR/20000;
    fixed_point_t addend5 = FRACTION_DIVISOR/200000;

    // Print addends used for rounding.
    printf("addend0 = %u.\n", addend0);
    printf("addend1 = %u.\n", addend1);
    printf("addend2 = %u.\n", addend2);
    printf("addend3 = %u.\n", addend3);
    printf("addend4 = %u.\n", addend4);
    printf("addend5 = %u.\n", addend5);

    // Calculate rounded prices
    fixed_point_t price_rounded0 = price + addend0; // round to 0 decimal digits
    fixed_point_t price_rounded1 = price + addend1; // round to 1 decimal digits
    fixed_point_t price_rounded2 = price + addend2; // round to 2 decimal digits
    fixed_point_t price_rounded3 = price + addend3; // round to 3 decimal digits
    fixed_point_t price_rounded4 = price + addend4; // round to 4 decimal digits
    fixed_point_t price_rounded5 = price + addend5; // round to 5 decimal digits

    // Print manually rounded prices of manually-printed fixed point integers as though they were "floats".
    printf("rounded price (manual float, rounded to 0 digits after decimal) is %u.\n", 
           price_rounded0 >> FRACTION_BITS); 
    printf("rounded price (manual float, rounded to 1 digit  after decimal) is %u.%01lu.\n", 
           price_rounded1 >> FRACTION_BITS, (uint64_t)(price_rounded1 & FRACTION_MASK) * 10 / FRACTION_DIVISOR); 
    printf("rounded price (manual float, rounded to 2 digits after decimal) is %u.%02lu.\n", 
           price_rounded2 >> FRACTION_BITS, (uint64_t)(price_rounded2 & FRACTION_MASK) * 100 / FRACTION_DIVISOR); 
    printf("rounded price (manual float, rounded to 3 digits after decimal) is %u.%03lu.\n", 
           price_rounded3 >> FRACTION_BITS, (uint64_t)(price_rounded3 & FRACTION_MASK) * 1000 / FRACTION_DIVISOR); 
    printf("rounded price (manual float, rounded to 4 digits after decimal) is %u.%04lu.\n", 
           price_rounded4 >> FRACTION_BITS, (uint64_t)(price_rounded4 & FRACTION_MASK) * 10000 / FRACTION_DIVISOR); 
    printf("rounded price (manual float, rounded to 5 digits after decimal) is %u.%05lu.\n", 
           price_rounded5 >> FRACTION_BITS, (uint64_t)(price_rounded5 & FRACTION_MASK) * 100000 / FRACTION_DIVISOR); 


    // =================================================================================================================

    printf("\nRELATED CONCEPT: DOING LARGE-INTEGER MATH WITH SMALL INTEGER TYPES:\n");

    // RELATED CONCEPTS:
    // Now let's practice handling (doing math on) large integers (ie: large relative to their integer type),
    // withOUT resorting to using larger integer types (because they may not exist for our target processor), 
    // and withOUT using floating point math, since that might also either not exist for our processor, or be too
    // slow or program-space-intensive for our application.
    // - These concepts are especially useful when you hit the limits of your architecture's integer types: ex: 
    //   if you have a uint64_t nanosecond timestamp that is really large, and you need to multiply it by a fraction
    //   to convert it, but you don't have uint128_t types available to you to multiply by the numerator before 
    //   dividing by the denominator. What do you do?
    // - We can use fixed-point math to achieve desired results. Let's look at various approaches.
    // - Let's say my goal is to multiply a number by a fraction < 1 withOUT it ever growing into a larger type.
    // - Essentially we want to multiply some really large number (near its range limit for its integer type)
    //   by some_number/some_larger_number (ie: a fraction < 1). The problem is that if we multiply by the numerator
    //   first, it will overflow, and if we divide by the denominator first we will lose resolution via bits 
    //   right-shifting out.
    // Here are various examples and approaches.

    // -----------------------------------------------------
    // EXAMPLE 1
    // Goal: Use only 16-bit values & math to find 65401 * 16/127.
    // Result: Great! All 3 approaches work, with the 3rd being the best. To learn the techniques required for the 
    // absolute best approach of all, take a look at the 8th approach in Example 2 below.
    // -----------------------------------------------------
    uint16_t num16 = 65401; // 1111 1111 0111 1001 
    uint16_t times = 16;
    uint16_t divide = 127;
    
    printf("\nEXAMPLE 1\n");

    // Find the true answer.
    // First, let's cheat to know the right answer by letting it grow into a larger type. 
    // Multiply *first* (before doing the divide) to avoid losing resolution.
    printf("%u * %u/%u = %u. <== true answer\n", num16, times, divide, (uint32_t)num16*times/divide);
    
    // 1st approach: just divide first to prevent overflow, and lose precision right from the start.
    uint16_t num16_result = num16/divide * times;
    printf("1st approach (divide then multiply):\n");
    printf("  num16_result = %u. <== Loses bits that right-shift out during the initial divide.\n", num16_result);
    
    // 2nd approach: split the 16-bit number into 2 8-bit numbers stored in 16-bit numbers, 
    // placing all 8 bits of each sub-number to the ***far right***, with 8 bits on the left to grow
    // into when multiplying. Then, multiply and divide each part separately. 
    // - The problem, however, is that you'll lose meaningful resolution on the upper-8-bit number when you 
    //   do the division, since there's no bits to the right for the right-shifted bits during division to 
    //   be retained in.
    // Re-sum both sub-numbers at the end to get the final result. 
    // - NOTE THAT 257 IS THE HIGHEST *TIMES* VALUE I CAN USE SINCE 2^16/0b0000,0000,1111,1111 = 65536/255 = 257.00392.
    //   Therefore, any *times* value larger than this will cause overflow.
    uint16_t num16_upper8 = num16 >> 8; // 1111 1111
    uint16_t num16_lower8 = num16 & 0xFF; // 0111 1001
    num16_upper8 *= times;
    num16_lower8 *= times;
    num16_upper8 /= divide;
    num16_lower8 /= divide;
    num16_result = (num16_upper8 << 8) + num16_lower8;
    printf("2nd approach (split into 2 8-bit sub-numbers with bits at far right):\n");
    printf("  num16_result = %u. <== Loses bits that right-shift out during the divide.\n", num16_result);
    
    // 3rd approach: split the 16-bit number into 2 8-bit numbers stored in 16-bit numbers, 
    // placing all 8 bits of each sub-number ***in the center***, with 4 bits on the left to grow when 
    // multiplying and 4 bits on the right to not lose as many bits when dividing. 
    // This will help stop the loss of resolution when we divide, at the cost of overflowing more easily when we 
    // multiply.
    // - NOTE THAT 16 IS THE HIGHEST *TIMES* VALUE I CAN USE SINCE 2^16/0b0000,1111,1111,0000 = 65536/4080 = 16.0627.
    //   Therefore, any *times* value larger than this will cause overflow.
    num16_upper8 = (num16 >> 4) & 0x0FF0;
    num16_lower8 = (num16 << 4) & 0x0FF0;
    num16_upper8 *= times;
    num16_lower8 *= times;
    num16_upper8 /= divide;
    num16_lower8 /= divide;
    num16_result = (num16_upper8 << 4) + (num16_lower8 >> 4);
    printf("3rd approach (split into 2 8-bit sub-numbers with bits centered):\n");
    printf("  num16_result = %u. <== Perfect! Retains the bits that right-shift during the divide.\n", num16_result);

    // -----------------------------------------------------
    // EXAMPLE 2
    // Goal: Use only 16-bit values & math to find 65401 * 99/127.
    // Result: Many approaches work, so long as enough bits exist to the left to not allow overflow during the 
    // multiply. The best approach is the 8th one, however, which 1) right-shifts the minimum possible before the
    // multiply, in order to retain as much resolution as possible, and 2) does integer rounding during the divide
    // in order to be as accurate as possible. This is the best approach to use.
    // -----------------------------------------------------
    num16 = 65401; // 1111 1111 0111 1001 
    times = 99;
    divide = 127;

    printf("\nEXAMPLE 2\n");

    // Find the true answer by letting it grow into a larger type.
    printf("%u * %u/%u = %u. <== true answer\n", num16, times, divide, (uint32_t)num16*times/divide);

    // 1st approach: just divide first to prevent overflow, and lose precision right from the start.
    num16_result = num16/divide * times;
    printf("1st approach (divide then multiply):\n");
    printf("  num16_result = %u. <== Loses bits that right-shift out during the initial divide.\n", num16_result);

    // 2nd approach: split the 16-bit number into 2 8-bit numbers stored in 16-bit numbers, 
    // placing all 8 bits of each sub-number to the ***far right***, with 8 bits on the left to grow
    // into when multiplying. Then, multiply and divide each part separately. 
    // - The problem, however, is that you'll lose meaningful resolution on the upper-8-bit number when you 
    //   do the division, since there's no bits to the right for the right-shifted bits during division to 
    //   be retained in.
    // Re-sum both sub-numbers at the end to get the final result. 
    // - NOTE THAT 257 IS THE HIGHEST *TIMES* VALUE I CAN USE SINCE 2^16/0b0000,0000,1111,1111 = 65536/255 = 257.00392.
    //   Therefore, any *times* value larger than this will cause overflow.
    num16_upper8 = num16 >> 8; // 1111 1111
    num16_lower8 = num16 & 0xFF; // 0111 1001
    num16_upper8 *= times;
    num16_lower8 *= times;
    num16_upper8 /= divide;
    num16_lower8 /= divide;
    num16_result = (num16_upper8 << 8) + num16_lower8;
    printf("2nd approach (split into 2 8-bit sub-numbers with bits at far right):\n");
    printf("  num16_result = %u. <== Loses bits that right-shift out during the divide.\n", num16_result);
    
    /////////////////////////////////////////////////////////////////////////////////////////////////
    // TRUNCATED BECAUSE STACK OVERFLOW WON'T ALLOW THIS MANY CHARACTERS.
    // See the rest of the code on github: https://github.com/ElectricRCAircraftGuy/fixed_point_math
    /////////////////////////////////////////////////////////////////////////////////////////////////

    return 0;
} // main

// PRIVATE FUNCTION DEFINITIONS:

/// @brief A function to help identify at what decimal digit error is introduced, based on how many bits you are using
///        to represent the fractional portion of the number in your fixed-point number system.
/// @details    Note: this function relies on an internal static bool to keep track of if it has already
///             identified at what decimal digit error is introduced, so once it prints this fact once, it will never 
///             print again. This is by design just to simplify usage in this demo.
/// @param[in]  num_digits_after_decimal    The number of decimal digits we are printing after the decimal 
///             (0, 1, 2, 3, etc)
/// @return     None
static void print_if_error_introduced(uint8_t num_digits_after_decimal)
{
    static bool already_found = false;

    // Array of power base 10 values, where the value = 10^index:
    const uint32_t POW_BASE_10[] = 
    {
        1, // index 0 (10^0)
        10, 
        100, 
        1000, 
        10000, 
        100000,
        1000000,
        10000000,
        100000000,
        1000000000, // index 9 (10^9); 1 Billion: the max power of 10 that can be stored in a uint32_t
    };

    if (already_found == true)
    {
        goto done;
    }

    if (POW_BASE_10[num_digits_after_decimal] > FRACTION_DIVISOR)
    {
        already_found = true;
        printf(" <== Fixed-point math decimal error first\n"
               "    starts to get introduced here since the fixed point resolution (1/%u) now has lower resolution\n"
               "    than the base-10 resolution (which is 1/%u) at this decimal place. Decimal error may not show\n"
               "    up at this decimal location, per say, but definitely will for all decimal places hereafter.", 
               FRACTION_DIVISOR, POW_BASE_10[num_digits_after_decimal]);
    }

done:
    printf("\n");
}

输出:

gabriel$ cp fixed_point_math.cpp fixed_point_math_copy.c && gcc -Wall -std=c99 -o ./bin/fixed_point_math_c > fixed_point_math_copy.c && ./bin/fixed_point_math_c  
Begin.  
fraction bits = 16.  
whole number bits = 16.  
max whole number = 65535.  
  
price as a true double is 219.857142857.  
price as integer is 219.  
price fractional part is 56173 (of 65536).  
price fractional part as decimal is 0.857132 (56173/65536).  
price (manual float, 0 digits after decimal) is 219.  
price (manual float, 1 digit  after decimal) is 219.8.  
price (manual float, 2 digits after decimal) is 219.85.  
price (manual float, 3 digits after decimal) is 219.857.  
price (manual float, 4 digits after decimal) is 219.8571.  
price (manual float, 5 digits after decimal) is 219.85713. <== Fixed-point math decimal error first  
    starts to get introduced here since the fixed point resolution (1/65536) now has lower resolution  
    than the base-10 resolution (which is 1/100000) at this decimal place. Decimal error may not show  
    up at this decimal location, per say, but definitely will for all decimal places hereafter.  
price (manual float, 6 digits after decimal) is 219.857131.  
  
WITH MANUAL INTEGER-BASED ROUNDING:  
addend0 = 32768.  
addend1 = 3276.  
addend2 = 327.  
addend3 = 32.  
addend4 = 3.  
addend5 = 0.  
rounded price (manual float, rounded to 0 digits after decimal) is 220.  
rounded price (manual float, rounded to 1 digit  after decimal) is 219.9.  
rounded price (manual float, rounded to 2 digits after decimal) is 219.86.  
rounded price (manual float, rounded to 3 digits after decimal) is 219.857.  
rounded price (manual float, rounded to 4 digits after decimal) is 219.8571.  
rounded price (manual float, rounded to 5 digits after decimal) is 219.85713.  
  
RELATED CONCEPT: DOING LARGE-INTEGER MATH WITH SMALL INTEGER TYPES:  
  
EXAMPLE 1  
65401 * 16/127 = 8239. <== true answer  
1st approach (divide then multiply):  
  num16_result = 8224. <== Loses bits that right-shift out during the initial divide.  
2nd approach (split into 2 8-bit sub-numbers with bits at far right):  
  num16_result = 8207. <== Loses bits that right-shift out during the divide.  
3rd approach (split into 2 8-bit sub-numbers with bits centered):  
  num16_result = 8239. <== Perfect! Retains the bits that right-shift during the divide.  
  
EXAMPLE 2  
65401 * 99/127 = 50981. <== true answer  
1st approach (divide then multiply):  
  num16_result = 50886. <== Loses bits that right-shift out during the initial divide.  
2nd approach (split into 2 8-bit sub-numbers with bits at far right):  
  num16_result = 50782. <== Loses bits that right-shift out during the divide.  
3rd approach (split into 2 8-bit sub-numbers with bits centered):  
  num16_result = 1373. <== Completely wrong due to overflow during the multiply.  
4th approach (split into 4 4-bit sub-numbers with bits centered):  
  num16_result = 15870. <== Completely wrong due to overflow during the multiply.  
5th approach (split into 8 2-bit sub-numbers with bits centered):  
  num16_result = 50922. <== Loses a few bits that right-shift out during the divide.  
6th approach (split into 16 1-bit sub-numbers with bits skewed left):  
  num16_result = 50963. <== Loses the fewest possible bits that right-shift out during the divide.  
7th approach (split into 16 1-bit sub-numbers with bits skewed left):  
  num16_result = 50963. <== [same as 6th approach] Loses the fewest possible bits that right-shift out during the divide.  
[BEST APPROACH OF ALL] 8th approach (split into 16 1-bit sub-numbers with bits skewed left, w/integer rounding during division):  
  num16_result = 50967. <== Loses the fewest possible bits that right-shift out during the divide,   
  & has better accuracy due to rounding during the divide.  

参考资料:


0

如果你的唯一目的是为了节省内存,我不建议你这样做。价格计算中的误差可能会累积,最终导致错误。

如果你真的想要实现类似的功能,你可以只取价格的最小区间,然后直接使用 int 和整数操作来处理你的数字。只有在显示时才需要将其转换为浮点数,这样会让你的生活更轻松。


这更像是一个项目。所以我已经知道我将在小数点前后有5个数字。 - RagHaven

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接