Floating Point Formats Cheat Sheet

Floating Point Format Cheat Sheet
=================================

S:E:I:M   S + E + I + M = N
-----------------------------------

 - S.ign bit stored: 1: present (almost all formats), 1: absent (specific hardware has unsigned floats)

 - E.xponent length: Usually >0 unless you want an unsigned integer with -0 support

 - I.mplicit mantissa MSB: 0: implicit (almost all formats), 1: explicit (8087 internal format),

 - M.antissa bits: Usually >0 unless all you want is powers of 2

 - N total number of stored bits (memory layout typically varies by endianness LE or BE)


--- IEEE 754 formats

binary16 = 1:5:0:10 ("f16", "float16", "float" (nowadays))

binary32 = 1:8:0:23 ("f32", "float32", "float" (nowadays))

binary64 = 1:11:0:52 ("f64", "float64", "double" (nowadays))

binary128 = 1:15:0:112 ("f128", "float128", "long double" or "double double" (nowadays))


--- Application-specific formats

bfloat16 = 1:8:0:7

Pixar PXR24 = 1:8:0:15

AMD fp24 = 1:7:0:16

Intel 8087 extended precision (80 bit) = 1:15:1:63


--- Handy rules

E = length of exponent in bits
M = length of mantissa in bits
I+M = actual length of number with implicit or explicit bit

s = sign actual bit
e = exponent actual bits (unsigned integer)
m = mantissa actual bits (unsigned integer)
bias = 2^(E - 1) - 1


--- Decoding to value

Normal (implicit mantissa leading bit = 1):
n = -1.0^s * (1.0 + (m / 2.0^(M - 1))) * 2.0^(e - bias)

Denormal (implicit mantissa leading bit = 0):
d = -1.0^s * (m / 2.0^(M - 1)) * 2.0^(e - bias - 1)

Numbers (explicit leading bit, denormals are stored directly)
x = -1.0^s * (m / 2.0^M)) * 2.0^(e - bias)


--- Encoding to format

Turn a long string of binary-encoded floating point number of many long bits into a floating point number:


0. Find the leading nonzero bit, and discard it (for implicit formats), or store it (for explicit)
1. Fill the mantissa, m, with the next M bits (or pad with zeros if not present)
2. Save the exponent, e, as the count of bits from the decimal point to the leading nonzero bit, *and* add then bias to it

If e >= ~0, return +/-Infinite
If e < 0 && e > -E, return a denormal number with E = 0 and M = partial bits that just barely make it
If e <= -E, return +/-0.0
Otherwise, it's a normal number

                               175073.3333334513784317
0000000000000000000101010101111100001.010101010101010101010111010100000101010101001010101001

binary64
e = 17 + 1023 = 10000010000
m = 0101010111110000101010101010101010101011101010000010
0 10000010000 0101010111110000101010101010101010101011101010000010


--- Classifications


+/-0.0               e = 0 && m = 0         ? 000...0 000...0
+0.0        s = 0 && e = 0 && m = 0         0 000...0 000...0
-0.0        s = 1 && e = 0 && m = 0         1 000...0 000...0

+/-normal            e != 0 && e != ~0      ?   e'    ???...?  0 < e' < ~0
+normal     s = 0 && e != 0 && e != ~0      0   e'    ???...?
-normal     s = 1 && e != 0 && e != ~0      1   e'    ???...?

+/-denormal          e = 0 && m != 0        ? 000...0   m'     m' != 0
+denormal   s = 0 && e = 0 && m != 0        0 000...0   m'
-denormal   s = 1 && e = 0 && m != 0        1 000...0   m'

+/-inf               e = ~0 && m = 0        ? 111...1 000...0
+inf        s = 0 && e = ~0 && m = 0        0 111...1 000...0
-inf        s = 1 && e = ~0 && m = 0        1 111...1 000...0

+/-s/qNAN   e = ~0 && m != 0                ? 111...1   r       r != 0
+/-sNaN     e = ~0           && MSB(m) = 1  ? 111...1 1??...?
+/-qNaN     e = ~0 && m != 0 && MSB(m) = 0  ? 111...1 0qq...q   q != 0

Notes:
------
normal   = regular numbers
denormal = indicates a calculation underflow where precision is lost but has some residual, non-zero value
sNaN     = signaling NaN: may or may not cause a program error
qNaN     = quiet NaN:     may or may not silently flow through a program
&&       = logical (not binary) "and"
?        = can be 0 or 1
~0       = all ones
MSB()    = most significant bit