Advertisement
Guest User

Floating Point Formats Cheat Sheet

a guest
Jun 19th, 2025
80
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 4.00 KB | Software | 0 0
  1. Floating Point Format Cheat Sheet
  2. =================================
  3.  
  4. S:E:I:M S + E + I + M = N
  5. -----------------------------------
  6.  
  7. - S.ign bit stored: 1: present (almost all formats), 1: absent (specific hardware has unsigned floats)
  8.  
  9. - E.xponent length: Usually >0 unless you want an unsigned integer with -0 support
  10.  
  11. - I.mplicit mantissa MSB: 0: implicit (almost all formats), 1: explicit (8087 internal format),
  12.  
  13. - M.antissa bits: Usually >0 unless all you want is powers of 2
  14.  
  15. - N total number of stored bits (memory layout typically varies by endianness LE or BE)
  16.  
  17.  
  18. --- IEEE 754 formats
  19.  
  20. binary16 = 1:5:0:10 ("f16", "float16", "float" (nowadays))
  21.  
  22. binary32 = 1:8:0:23 ("f32", "float32", "float" (nowadays))
  23.  
  24. binary64 = 1:11:0:52 ("f64", "float64", "double" (nowadays))
  25.  
  26. binary128 = 1:15:0:112 ("f128", "float128", "long double" or "double double" (nowadays))
  27.  
  28.  
  29. --- Application-specific formats
  30.  
  31. bfloat16 = 1:8:0:7
  32.  
  33. Pixar PXR24 = 1:8:0:15
  34.  
  35. AMD fp24 = 1:7:0:16
  36.  
  37. Intel 8087 extended precision (80 bit) = 1:15:1:63
  38.  
  39.  
  40. --- Handy rules
  41.  
  42. E = length of exponent in bits
  43. M = length of mantissa in bits
  44. I+M = actual length of number with implicit or explicit bit
  45.  
  46. s = sign actual bit
  47. e = exponent actual bits (unsigned integer)
  48. m = mantissa actual bits (unsigned integer)
  49. bias = 2^(E - 1) - 1
  50.  
  51.  
  52. --- Decoding to value
  53.  
  54. Normal (implicit mantissa leading bit = 1):
  55. n = -1.0^s * (1.0 + (m / 2.0^(M - 1))) * 2.0^(e - bias)
  56.  
  57. Denormal (implicit mantissa leading bit = 0):
  58. d = -1.0^s * (m / 2.0^(M - 1)) * 2.0^(e - bias - 1)
  59.  
  60. Numbers (explicit leading bit, denormals are stored directly)
  61. x = -1.0^s * (m / 2.0^M)) * 2.0^(e - bias)
  62.  
  63.  
  64. --- Encoding to format
  65.  
  66. Turn a long string of binary-encoded floating point number of many long bits into a floating point number:
  67.  
  68.  
  69. 0. Find the leading nonzero bit, and discard it (for implicit formats), or store it (for explicit)
  70. 1. Fill the mantissa, m, with the next M bits (or pad with zeros if not present)
  71. 2. Save the exponent, e, as the count of bits from the decimal point to the leading nonzero bit, *and* add then bias to it
  72.  
  73. If e >= ~0, return +/-Infinite
  74. If e < 0 && e > -E, return a denormal number with E = 0 and M = partial bits that just barely make it
  75. If e <= -E, return +/-0.0
  76. Otherwise, it's a normal number
  77.  
  78. 175073.3333334513784317
  79. 0000000000000000000101010101111100001.010101010101010101010111010100000101010101001010101001
  80.  
  81. binary64
  82. e = 17 + 1023 = 10000010000
  83. m = 0101010111110000101010101010101010101011101010000010
  84. 0 10000010000 0101010111110000101010101010101010101011101010000010
  85.  
  86.  
  87. --- Classifications
  88.  
  89.  
  90. +/-0.0 e = 0 && m = 0 ? 000...0 000...0
  91. +0.0 s = 0 && e = 0 && m = 0 0 000...0 000...0
  92. -0.0 s = 1 && e = 0 && m = 0 1 000...0 000...0
  93.  
  94. +/-normal e != 0 && e != ~0 ? e' ???...? 0 < e' < ~0
  95. +normal s = 0 && e != 0 && e != ~0 0 e' ???...?
  96. -normal s = 1 && e != 0 && e != ~0 1 e' ???...?
  97.  
  98. +/-denormal e = 0 && m != 0 ? 000...0 m' m' != 0
  99. +denormal s = 0 && e = 0 && m != 0 0 000...0 m'
  100. -denormal s = 1 && e = 0 && m != 0 1 000...0 m'
  101.  
  102. +/-inf e = ~0 && m = 0 ? 111...1 000...0
  103. +inf s = 0 && e = ~0 && m = 0 0 111...1 000...0
  104. -inf s = 1 && e = ~0 && m = 0 1 111...1 000...0
  105.  
  106. +/-s/qNAN e = ~0 && m != 0 ? 111...1 r r != 0
  107. +/-sNaN e = ~0 && MSB(m) = 1 ? 111...1 1??...?
  108. +/-qNaN e = ~0 && m != 0 && MSB(m) = 0 ? 111...1 0qq...q q != 0
  109.  
  110. Notes:
  111. ------
  112. normal = regular numbers
  113. denormal = indicates a calculation underflow where precision is lost but has some residual, non-zero value
  114. sNaN = signaling NaN: may or may not cause a program error
  115. qNaN = quiet NaN: may or may not silently flow through a program
  116. && = logical (not binary) "and"
  117. ? = can be 0 or 1
  118. ~0 = all ones
  119. MSB() = most significant bit
  120.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement