Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Floating Point Format Cheat Sheet
- =================================
- S:E:I:M S + E + I + M = N
- -----------------------------------
- - S.ign bit stored: 1: present (almost all formats), 1: absent (specific hardware has unsigned floats)
- - E.xponent length: Usually >0 unless you want an unsigned integer with -0 support
- - I.mplicit mantissa MSB: 0: implicit (almost all formats), 1: explicit (8087 internal format),
- - M.antissa bits: Usually >0 unless all you want is powers of 2
- - N total number of stored bits (memory layout typically varies by endianness LE or BE)
- --- IEEE 754 formats
- binary16 = 1:5:0:10 ("f16", "float16", "float" (nowadays))
- binary32 = 1:8:0:23 ("f32", "float32", "float" (nowadays))
- binary64 = 1:11:0:52 ("f64", "float64", "double" (nowadays))
- binary128 = 1:15:0:112 ("f128", "float128", "long double" or "double double" (nowadays))
- --- Application-specific formats
- bfloat16 = 1:8:0:7
- Pixar PXR24 = 1:8:0:15
- AMD fp24 = 1:7:0:16
- Intel 8087 extended precision (80 bit) = 1:15:1:63
- --- Handy rules
- E = length of exponent in bits
- M = length of mantissa in bits
- I+M = actual length of number with implicit or explicit bit
- s = sign actual bit
- e = exponent actual bits (unsigned integer)
- m = mantissa actual bits (unsigned integer)
- bias = 2^(E - 1) - 1
- --- Decoding to value
- Normal (implicit mantissa leading bit = 1):
- n = -1.0^s * (1.0 + (m / 2.0^(M - 1))) * 2.0^(e - bias)
- Denormal (implicit mantissa leading bit = 0):
- d = -1.0^s * (m / 2.0^(M - 1)) * 2.0^(e - bias - 1)
- Numbers (explicit leading bit, denormals are stored directly)
- x = -1.0^s * (m / 2.0^M)) * 2.0^(e - bias)
- --- Encoding to format
- Turn a long string of binary-encoded floating point number of many long bits into a floating point number:
- 0. Find the leading nonzero bit, and discard it (for implicit formats), or store it (for explicit)
- 1. Fill the mantissa, m, with the next M bits (or pad with zeros if not present)
- 2. Save the exponent, e, as the count of bits from the decimal point to the leading nonzero bit, *and* add then bias to it
- If e >= ~0, return +/-Infinite
- If e < 0 && e > -E, return a denormal number with E = 0 and M = partial bits that just barely make it
- If e <= -E, return +/-0.0
- Otherwise, it's a normal number
- 175073.3333334513784317
- 0000000000000000000101010101111100001.010101010101010101010111010100000101010101001010101001
- binary64
- e = 17 + 1023 = 10000010000
- m = 0101010111110000101010101010101010101011101010000010
- 0 10000010000 0101010111110000101010101010101010101011101010000010
- --- Classifications
- +/-0.0 e = 0 && m = 0 ? 000...0 000...0
- +0.0 s = 0 && e = 0 && m = 0 0 000...0 000...0
- -0.0 s = 1 && e = 0 && m = 0 1 000...0 000...0
- +/-normal e != 0 && e != ~0 ? e' ???...? 0 < e' < ~0
- +normal s = 0 && e != 0 && e != ~0 0 e' ???...?
- -normal s = 1 && e != 0 && e != ~0 1 e' ???...?
- +/-denormal e = 0 && m != 0 ? 000...0 m' m' != 0
- +denormal s = 0 && e = 0 && m != 0 0 000...0 m'
- -denormal s = 1 && e = 0 && m != 0 1 000...0 m'
- +/-inf e = ~0 && m = 0 ? 111...1 000...0
- +inf s = 0 && e = ~0 && m = 0 0 111...1 000...0
- -inf s = 1 && e = ~0 && m = 0 1 111...1 000...0
- +/-s/qNAN e = ~0 && m != 0 ? 111...1 r r != 0
- +/-sNaN e = ~0 && MSB(m) = 1 ? 111...1 1??...?
- +/-qNaN e = ~0 && m != 0 && MSB(m) = 0 ? 111...1 0qq...q q != 0
- Notes:
- ------
- normal = regular numbers
- denormal = indicates a calculation underflow where precision is lost but has some residual, non-zero value
- sNaN = signaling NaN: may or may not cause a program error
- qNaN = quiet NaN: may or may not silently flow through a program
- && = logical (not binary) "and"
- ? = can be 0 or 1
- ~0 = all ones
- MSB() = most significant bit
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement