Fixed-point arithmetic

In computing, a fixed-point number representation is a real data type for a number that has a fixed number of digits after (and sometimes also before) the radix point (after the decimal point '.' in English decimal notation). Fixed-point number representation can be compared to the more complicated (and more computationally demanding) floating-point number representation.

Fixed-point numbers are useful for representing fractional values, usually in base 2 or base 10, when the executing processor has no floating point unit (FPU) as is the case for older or low-cost embedded microprocessors and microcontrollers, if fixed-point provides improved performance or accuracy for the application at hand, or if their use is more natural for the problem (such as for the representation of angles).

Representation

value displayed with integer scaled by 1/100
Displayed value Integer handled internally
0.00 0
0.01 1
0.02 2
...
0.99 99
1.00 100

A value of a fixed-point data type is essentially an integer that is scaled by an implicit specific factor determined by the type. For example, the value 1.23 can be represented as 1230 in a fixed-point data type with scaling factor of 1/1000, and the value 1,230,000 can be represented as 1230 with a scaling factor of 1000. Unlike floating-point data types, the scaling factor is the same for all values of the same type, and does not change during the entire computation.

The scaling factor is usually a power of 10 (for human convenience) or a power of 2 (for computational efficiency). However, other scaling factors may be used occasionally, e.g. a time value in hours may be represented as a fixed-point type with a scale factor of 1/3600 to obtain values with one-second accuracy.

The maximum value of a fixed-point type is simply the largest value that can be represented in the underlying integer type multiplied by the scaling factor; and similarly for the minimum value.

Operations

To convert a number from a fixed point type with scaling factor R to another type with scaling factor S, the underlying integer must be multiplied by R and divided by S; that is, multiplied by the ratio R/S. Thus, for example, to convert the value 1.23 = 123/100 from a type with scaling factor R=1/100 to one with scaling factor S=1/1000, the underlying integer 123 must be multiplied by (1/100)/(1/1000) = 10, yielding the representation 1230/1000. If S does not divide R (in particular, if the new scaling factor S is greater than the original R), the new integer will have to be rounded. The rounding rules and methods are usually part of the language's specification.

To add or subtract two values of the same fixed-point type, it is sufficient to add or subtract the underlying integers, and keep their common scaling factor. The result can be exactly represented in the same type, as long as no overflow occurs (i.e. provided that the sum of the two integers fits in the underlying integer type). If the numbers have different fixed-point types, with different scaling factors, then one of them must be converted to the other before the sum.

To multiply two fixed-point numbers, it suffices to multiply the two underlying integers, and assume that the scaling factor of the result is the product of their scaling factors. This operation involves no rounding. For example, multiplying the numbers 123 scaled by 1/1000 (0.123) and 25 scaled by 1/10 (2.5) yields the integer 123×25 = 3075 scaled by (1/1000)×(1/10) = 1/10000, that is 3075/10000 = 0.3075. If the two operands belong to the same fixed-point type, and the result is also to be represented in that type, then the product of the two integers must be explicitly multiplied by the common scaling factor; in this case the result may have to be rounded, and overflow may occur. For example, if the common scaling factor is 1/100, multiplying 1.23 by 0.25 entails multiplying 123 by 25 to yield 3075 with an intermediate scaling factor of 1/10000. This then must be multiplied by 1/100 to yield either 31 (0.31) or 30 (0.30), depending on the rounding method used, to result in a final scale factor of 1/100.

To divide two fixed-point numbers, one takes the integer quotient of their underlying integers, and assumes that the scaling factor is the quotient of their scaling factors. The first division involves rounding in general. For example, division of 3456 scaled by 1/100 (34.56) and 1234 scaled by 1/1000 (1.234) yields the integer 3456÷1234 = 3 (rounded) with scale factor (1/100)/(1/1000) = 10, that is, 30. One can obtain a more accurate result by first converting the dividend to a more precise type: in the same example, converting 3456 scaled by 1/100 (34.56) to 3,456,000 scaled by 1/100000, before dividing by 1234 scaled by 1/1000 (1.234), would yield 3456000÷1234 = 2801 (rounded) with scaling factor (1/100000)/(1/1000) = 1/100, that is 28.01 (instead of 30). If both operands and the desired result all have the same scaling factor, then the quotient of the two integers must be explicitly multiplied by that common scaling factor.

Binary vs. decimal

The two most common classes of fixed-point types are decimal and binary. Decimal fixed-point types have a scaling factor that is a power of ten; for binary fixed-point types it is a power of two.

Binary fixed-point types are most commonly used, because the rescaling operations can be implemented as fast bit shifts. Binary fixed-point numbers can represent fractional powers of two exactly, but, like binary floating-point numbers, cannot exactly represent fractional powers of ten. If exact fractional powers of ten are desired, then a decimal format should be used. For example, one-tenth (0.1) and one-hundredth (0.01) can be represented only approximately by binary fixed-point or binary floating-point representations, while they can be represented exactly in decimal fixed-point or decimal floating-point representations. These representations may be encoded in many ways, including binary-coded decimal (BCD).

A worked Example

Suppose there is the following multiplication with 2 fixed point 3 decimal place numbers.

(10.500)(1.050) =1*10.500 + 0.050*10.500 = 10.500+0.525000=11.025000

Note how since there are 3 decimal places we show the trailing zeros. To re-characterize this as an integer multiplication we must first multiply by 1000 (10^3) moving all the decimal places in to integer places, then we will multiply by (10^-3) to put them back the equation now looks like

(10.500)(10^(3))(10^(-3)) (1.050)(10^(3))(10^(-3))  (10^(-3))(10^(-3))
= (10500)(1050) (10^-6)
= 11 025 000
= 11.025000

This works equivalently if we choose a different base, notably base 2 for computing, since a bit shift is the same as a multiplication or division by an order of 2. Three decimal digits is equivalent to about 10 binary digits, so we should round 0.05 to 10 bits after the binary point. The closest approximation is then 0.0000110011.

10= 8+2=2^3+2^1
1=2^0
0.5= 2^-1
0.05= 0.0000110011_2

Thus our multiplication becomes

(1010.100)(2^3)(1.0000110011)(2^10) (2^-13)
=(1010100)(10000110011) (2^-13)
=(10110000010111100) (2^-13)
=1011.0000010111100

This rounds to 11.023 with three digits after the decimal point.

Notation

There are various notations used to represent word length and radix point in a binary fixed-point number. In the following list, f represents the number of fractional bits, m the number of magnitude or integer bits, s the number of sign bits, and b the total number of bits.

  • Qf: The "Q" prefix. For example, Q15 represents a number with 15 fractional bits. This notation is ambiguous since it does not specify the word length, however it is usually assumed that the word length is either 16 or 32 bits depending on the target processor in use.[1]
  • Qm.f: The unambiguous form of the "Q" notation. Since the entire word is a 2's complement integer, a sign bit is implied. For example, Q1.30 describes a number with 1 integer bit and 30 fractional bits stored as a 32-bit 2's complement integer.[1][2]
  • fxm.b: The "fx" prefix is similar to the above, but uses the word length as the second item in the dotted pair. For example, fx1.16 describes a number with 1 magnitude bit and 15 fractional bits in a 16 bit word.[3]
  • s:m:f: Yet other notations include a sign bit, such as this one used in the PS2 GS User's Guide.[4] It also differs from conventional usage by using a colon instead of a period as the separator. For example, in this notation, 0:8:0 represents an unsigned 8-bit integer.
  • (p,q) Used in the PL/I programming language, to specify p total digits (not including sign) with q after the radix point. q can be positive or negative, and the radix binary or decimal.
  • <s,p,i> As used with the LabVIEW programming language, for 'FXP' fixed point numbers. Where s is either + or ±, signifying either an unsigned or 2's complement signed number respectively. This format indicates p total digits with i being the 'integer part'. It is notable that this format allows for arbitrary placement of the 'units' place - which need not occur within the given digits. That is; whilst the total bits p must be a natural integer, i may be larger, zero, or even negative - these situations merely correspond to different overall scaling factors for the number.

Binary scaling

Binary scaling is a computer programming technique used typically in embedded C, DSP and assembler programs to implement non-integer operations by using the native integer arithmetic of the processor.

Overview

A representation of a value using binary scaling is more precise than a floating-point representation occupying the same number of bits, but typically represents values of a more limited range, therefore more easily leading to arithmetic overflow during computation. Implementation of operations using integer arithmetic instructions is often (but not always) faster than the corresponding floating-point instructions.

A position for the 'binary point' is chosen for each variable to be represented, and binary shifts associated with arithmetic operations are adjusted accordingly. The binary scaling corresponds in Q (number format) to the first digit, i.e. Q1.15 is a 16 bit integer scaled with one bit as integer and fifteen as fractional. A Bscal 1 or Q1.15 number would represent approximately 0.999 to −1.0.

To give an example, a common way to use integer arithmetic to simulate floating point, using 32-bit numbers, is to multiply the coefficients by 65536.

Using binary scientific notation, this will place the binary point at B16. That is to say, the most significant 16 bits represent the integer part the remainder are represent the fractional part. This means, as a signed two's complement integer B16 number can hold a highest value of and a lowest value of −32768.0. Put another way, the B number, is the number of integer bits used to represent the number which defines its value range. Remaining low bits (i.e. the non-integer bits) are used to store fractional quantities and supply more accuracy.

For instance, to represent 1.2 and 5.6 as B16 one multiplies them by 216, giving 78643 and 367001 as the closest B16 representations.

Multiplying these together gives

28862059643

To convert it back to B16, divide it by 216 and round.

This gives 440400B16, which when converted back to a decimal number (by dividing again by 216) gives 6.71997 approximately. The correct result is 6.72.

Re-scaling after multiplication

The example above for a B16 multiplication is a simplified example. Re-scaling depends on both the B scale value and the word size. B16 is often used in 32 bit systems because it works simply by multiplying and dividing by 65536 (or shifting 16 bits).

Consider the Binary Point in a signed 32 bit word thus:

0 1 2 3 4 5 6 7 8 9
 S X X X X X X X   X X X X X X X X   X X X X X X X X   X X X X X X X X

where S is the sign bit and X are the other bits.

Placing the binary point at

  • 0 gives a range of −1.0 to 0.999999.
  • 1 gives a range of −2.0 to 1.999999
  • 2 gives a range of −4.0 to 3.999999 and so on.

When using different B scalings and/or word sizes the complete B scaling conversion formula must be used.

Consider a 32 bit word size, and two variables, one with a B scaling of 2 and the other with a scaling of 4.

1.4 @ B2 is 1.4 * (2 ^ (wordsize-2-1)) == 1.4 * 2 ^ 29 == 0x2CCCCCCD

Note that here the 1.4 values is very well represented with 30 fraction bits. A 32 bit floating-point number has 23 bits to store the fraction in. This is why B scaling is always more accurate than floating point of the same word size. This is especially useful in integrators or repeated summing of small quantities where rounding error can be a subtle but very dangerous problem when using floating point.

Now a larger number 15.2 at B4.

15.2 @ B4 is 15.2 * (2 ^ (wordsize-4-1)) == 15.2 * 2 ^ 27 == 0x7999999A

The number of bits to store the fraction is 28 bits. Multiplying these 32 bit numbers give the 64 bit result 0x1547AE14A51EB852

This result is in B7 in a 64 bit word. Shifting it down by 32 bits gives the result in B7 in 32 bits.

0x1547AE14

To convert back to a real number, divide this by (2^(wordsize-7-1)), resulting in 21.2800000099.

Various scalings may be used. B0 for instance can be used to represent any number between −1 and 0.999999999.

Binary angles

Binary scaling (B0) representation of angles. Black is traditional degrees representation, green is BAM as a decimal number and red is hexadecimal 32 bit representation of the BAM.

Binary angles are mapped using B0, with 0 as 0 degrees, 0.5 as 90° (or ), −1.0 or 0.9999999 as 180° (or π rad), and −0.5 as 270° (or ). When these binary angles are added using normal two's complement mathematics, the rotation of the angles is correct, even when crossing the sign boundary; this conveniently does away with checks for angles ≥ 360° when handling ordinary angles[5] (data that allow angles with more than one rotation must use some other encoding).

The terms binary angular measurement (BAM)[6] and binary angular measurement system (BAMS)[7] as well as brads (binary radians or binary degree) refer to implementations of binary angles. They find use in robotics, navigation,[8] computer games,[9] and digital sensors.[10]

No matter what bit-pattern is stored in a binary angle, when it is multiplied by 180° (or π rad) using standard signed fixed-point arithmetic, the result is always a valid angle in the range of −180 degrees (−π radians) to +180 degrees ( radians). In some cases, it is convenient to use unsigned multiplication (rather than signed multiplication) on a binary angle, which gives the correct angle in the range of 0 to +360 degrees (+2π radians or +1 turn). Compared to storing angles in a binary angle format, storing angles in any other format inevitably results in some bit patterns giving "angles" outside that range, requiring extra steps to range-reduce the value to the desired range, or results in some bit patterns that are not valid angles at all (NaN), or both.

Application of binary scaling techniques

Binary scaling techniques were used in the 1970s and 1980s for real-time computing that was mathematically intensive, such as flight simulation and in Nuclear Power Plant control algorithms since the late 1960s. The code was often commented with the binary scalings of the intermediate results of equations.

Binary scaling is still used in many DSP applications and custom made microprocessors are usually based on binary scaling techniques. The Binary angular measurement is used in the STM32G4 series built in CORDIC co-processors.

Binary scaling is currently used in the DCT used to compress JPEG images in utilities such as GIMP.

Although floating point has taken over to a large degree, where speed and extra accuracy are required, binary scaling works on simpler hardware and is more accurate when the range of values is known in advance and is sufficiently limited. This is because all the bits in a binary scaled integer are used for the precision of the value (though there may be leading zeros if the range of values is large), whereas in floating point, some bits are used to define the scaling.

Precision loss and overflow

Because fixed point operations can produce results that have more digits than the operands, information loss is possible. For instance, the result of fixed point multiplication could potentially have as many digits as the sum of the number of digits in the two operands. In order to fit the result into the same number of digits as the operands, the answer must be rounded or truncated. If this is the case, the choice of which digits to keep is very important. When multiplying two fixed point numbers with the same format, for instance with integer digits, and fractional digits, the answer could have up to integer digits, and fractional digits.

For simplicity, many fixed-point multiply procedures use the same result format as the operands. This has the effect of keeping the middle digits; the I-number of least significant integer digits, and the Q-number of most significant fractional digits. Fractional digits lost below this value represent a precision loss which is common in fractional multiplication. If any integer digits are lost, however, the value will be radically inaccurate. Some model-based fixed-point packages[11] allow specifying a result format different from the input formats, enabling the user to maximize precision and avoid overflow.

Some operations, like divide, often have built-in result limiting so that any positive overflow results in the largest possible number that can be represented by the current format. Likewise, negative overflow results in the largest negative number represented by the current format. This built in limiting is often referred to as saturation.

Some processors support a hardware overflow flag that can generate an exception on the occurrence of an overflow, but it is usually too late to salvage the proper result at this point.

Computer language implementations

Very few computer languages include built-in support for fixed point values other than with the radix point immediately to the right of the least significant digit (i.e. integer), because for most applications, binary or decimal floating-point representations are usually simpler to use and accurate enough. Floating-point representations are easier to use than fixed-point representations, because they can handle a wider dynamic range and do not require programmers to specify the number of digits after the radix point. However, if they are needed, fixed-point numbers can be implemented even in programming languages like C and C++, which do not commonly include such support.

A common use of fixed-point BCD numbers is for storing monetary values, where the inexact values of binary floating-point numbers are often a liability. Historically, fixed-point representations were the norm for decimal data types; for example, in PL/I or COBOL. The Ada programming language includes built-in support for both fixed-point (binary and decimal) and floating-point. JOVIAL and Coral 66 also provide both floating- and fixed-point types.

ISO/IEC TR 18037[12] specifies fixed-point data types for the C programming language; vendors are expected to implement the language extensions for fixed point arithmetic in coming years. Fixed-point support is implemented in GCC.[13][14]

Fixed-point should not be confused with Decimal floating point in programming languages like C# and Python.

Almost all relational databases, and the SQL, support fixed-point decimal arithmetic and storage of numbers. PostgreSQL has a special numeric type for exact storage of numbers with up to 1000 digits.[15]

Software application examples

  • The Nest Labs Utilities library, provides a limited set of macros and functions for fixed point numbers, particularly when dealing with those numbers in the context of sensor sampling and sensor outputs.
  • GnuCash is an application for tracking money which is written in C. It switched from a floating-point representation of money to a fixed-point implementation as of version 1.6. This change was made to trade the less predictable rounding errors of floating-point representations for more control over rounding (for example, to the nearest cent).
  • Tremor, Toast and MAD are software libraries which decode the Ogg Vorbis, GSM Full Rate and MP3 audio formats respectively. These codecs use fixed-point arithmetic because many audio decoding hardware devices do not have an FPU (partly to save money, but primarily to save power - integer units are much smaller in silicon area than an FPU) and audio decoding requires performance to the extent a software implementation of floating-point on low-speed devices would not produce output in real time.
  • All 3D graphics engines on Sony's original PlayStation, Sega's Saturn, Nintendo's Game Boy Advance (only 2D), Nintendo DS (2D and 3D), Nintendo Gamecube[16] and GP2X Wiz video game systems use fixed-point arithmetic for the same reason as Tremor and Toast: to gain throughput on an architecture without an FPU (e.g. the PlayStation included hardware support for 4.12bit values in its transformation coprocessor).
  • The OpenGL ES 1.x specification includes a fixed point profile, as it is an API aimed for embedded systems, which do not always have an FPU.
  • TeX uses fixed point with 16 bits after the binary point, in units of points, for all position calculations. TeX font metric files use 32-bit signed fixed-point numbers, with 12 bits to the left of the decimal.
  • The dc and bc programs are arbitrary precision calculators, but only keep track of a (user-specified) fixed number of fractional digits.
  • VisSim A visually programmed block diagram language supporting a fixed-point block set to allow simulation and automatic code generation of fixed-point operations. Both word size and radix point can be specified on an operator basis.
  • Fractint represents numbers as Q2.29 fixed-point numbers,[17] to speed up drawing on old PCs with 386 or 486SX processors, which lacked an FPU.
  • Doom was the last first-person shooter title by id Software to use a 16.16 fixed point representation for all of its non-integer computations, including map system, geometry, rendering, player movement etc. This was done in order for the game to be playable on 386 and 486SX CPUs without an FPU. For compatibility reasons, this representation is still used in modern Doom source ports.
  • Wavpack is a fixed-point, lossless audio compressor. Its creator, David Bryant, justifies the decision to implement fixed instead of floating point: " I believe that integer operations are less susceptible to subtle chip to chip variations that could corrupt the lossless nature of the compression."[18]
  • fixed point numbers are sometimes used for storing and manipulating images and video frames. Processors with SIMD units aimed at image processing may include instructions suitable for handling packed fixed point data.
  • The Q# programming language for the Azure quantum computers, that implement quantum logic gates, contains a standard numeric library for performing fixed-point arithmetic on registers of qubits.[19]
  • The TrueType font format uses the F26Dot6 format (32-bit signed fixed-point with 26 bits to the left of the decimal) for some numeric values in its instructions.[20] This format was chosen to provide the minimal amount of precision required for hinting and for performance reasons.[21]
  • The FixedPointNumbers package for Julia implements both 2n and 2n-1 scaled fixed point numbers. The latter are used as a foundation for image processing to provide a consistent scale for both "integer" and floating-point intensity values, thus avoiding the implicit "255 == 1.0" (non-equation) present in many other image processing frameworks.

See also

References

  1. ^ a b Texas Instruments, TMS320C64x DSP Library Programmer's Reference, Appendix A.2
  2. ^ "MathWorks Fixed-Point Toolbox Documentation Glossary". mathworks.com.
  3. ^ Inc., solidThinking. "VisSim is now solidThinking Embed". www.vissim.com.
  4. ^ PS2 GS User's Guide, Chapter 7.1 "Explanatory Notes"
  5. ^ Hargreaves, Shawn. "Angles, integers, and modulo arithmetic". blogs.msdn.com. Archived from the original on 2019-06-30. Retrieved 2019-08-05.
  6. ^ "Binary angular measurement". Archived from the original on 2009-12-21.
  7. ^ "Binary Angular Measurement System". acronyms.thefreedictionary.
  8. ^ LaPlante, Phillip A. (2004). "Chapter 7.5.3, Binary Angular Measure". Real-Time Systems Design and Analysis. www.globalspec.com.
  9. ^ Sanglard, Fabien (2010-01-13). "Doom 1993 code review - Section "Walls"". fabiensanglard.net.
  10. ^ "Hitachi HM55B Compass Module (#29123)" (PDF). www.hobbyengineering.com. Parallax Digital Compass Sensor (#29123). Parallax, Inc. May 2005. Archived from the original (PDF) on 2011-07-11 – via www.parallax.com.
  11. ^ VisSim Fixed-Point User Guide|http://www.vissim.com/downloads/doc/EmbeddedControlsDeveloper_UGv80.pdf
  12. ^ JTC1/SC22/WG14, status of TR 18037: Embedded C
  13. ^ GCC wiki, Fixed-Point Arithmetic Support
  14. ^ Using GCC, section 5.13 Fixed-Point Types
  15. ^ PostgreSQL manual, section 8.1.2. Arbitrary Precision Numbers
  16. ^ "Dolphin Emulator". Dolphin Emulator.
  17. ^ Fractint, A Little Code
  18. ^ "WavPack Technical Description". www.wavpack.com. Retrieved 2015-07-13.
  19. ^ "Introduction to the Quantum Numerics Library". Retrieved 2019-11-13.
  20. ^ "The TrueType Instruction Set: Data types".
  21. ^ "[Freetype] Why 26.6 ?".

Further reading

External links