Untitled

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lecture 13: Floating point arithmetic\n",
    "\n",
    "In order to complete computations in finite space and bounded time, we replace the real numbers with a surrogate finite set $\\mathbb{F}$, the floating point numbers. (The \"floating point\" term originally differentiated it from \"fixed point\", which was an early alternative system based on absolute errors rather than relative errors.) Most scientific computing now conforms to the IEEE 754 double precision standard. We won't need to think about the details of this standard going forward, but it is useful to briefly explore what is really going on in the computer.  \n",
    "\n",
    "In double precision there are 64 binary bits used to represent the members of $\\mathbb{F}$. We can get them directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"0011111111110000000000000000000000000000000000000000000000000000\""
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "one = bits(1.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These bits define three integers $s$, $e$, and $m$ used in the representation\n",
    "\n",
    "$$ x = (-1)^s \\cdot \\left( 1 + 2^{-52}m \\right) \\cdot 2^e.$$\n",
    "\n",
    "Here $s\\in\\{0,1\\}$ requires one bit, $e\\in\\{-1022,\\ldots,1023\\}$ requires 11 bits, and $m\\in\\{0,1,\\ldots,2^{52}-1\\}$ requires 52 bits. We can dissect a double precision number to see these parts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((\"0\",0),(\"01111111111\",0),(\"0000000000000000000000000000000000000000000000000000\",0))"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "function ieee(x::Float64)\n",
    "    b = bits(x);\n",
    "    s = (b[1:1],parse(Int,b[1:1]));\n",
    "    e = (b[2:12],parse(Int,b[2:12],2)-1023);\n",
    "    f = (b[13:64],parse(Int,b[13:64],2));\n",
    "    return s,e,f\n",
    "end\n",
    "\n",
    "ieee(1.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next-greater element of $\\mathbb{F}$ is $1+\\epsilon_M$, for machine epsilon $\\epsilon_M=2^{-52}$. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "eps() = 2.220446049250313e-16\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "((\"0\",0),(\"01111111111\",0),(\"0000000000000000000000000000000000000000000000000001\",1))"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@show eps()\n",
    "ieee(1.0+eps())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are $2^{52}$ elements of $\\mathbb{F}$ equally spaced throughout $[1,2)$. After these, the exponent increases to 1 and the value of $m$ resets to zero. Thus there are also $2^{52}$ elements equally spaced throughout $[2,4)$, as well as $[4,8)$, $[1/2,1)$, and in general, $[2^{e},2^{e+1})$. Let $\\text{fl}(x)$ be the element of $\\mathbb{F}$ nearest to any real number $x$. Consequently, if we momentarily ignore the bounds on the exponent $e$,\n",
    "\n",
    "$$ \\frac{\\bigl|x-\\text{fl}(x)\\bigr|}{\\bigl|x\\bigr|} \\le \\frac{1}{2} \\epsilon_M.$$\n",
    "\n",
    "In practice, the largest finite value in $\\mathbb{F}$ is"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "R = 2.0 ^ 1023 * (1 + (2 ^ 52 - 1) / 2 ^ 52) = 1.7976931348623157e308\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "((\"0\",0),(\"11111111110\",1023),(\"1111111111111111111111111111111111111111111111111111\",4503599627370495))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@show R = (2.0^1023)*(1 + (2^52-1)/2^52);\n",
    "ieee(R)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Results that should be larger than this become the special value Inf; this situation is called _overflow_."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Inf"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nextfloat(R)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.2250738585072014e-308"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "2.0^(-1022)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The analogous situation near zero is called _underflow_. Ignoring some special \"denormalized\" numbers, the smallest positive element of $\\mathbb{F}$ is  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "r = 2.0 ^ -1022 = 2.2250738585072014e-308\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "((\"0\",0),(\"00000000001\",-1022),(\"0000000000000000000000000000000000000000000000000000\",0))"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@show r = 2.0^-1022;\n",
    "ieee(r)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that this minimum value is far smaller than $\\epsilon_M$, which is the number spacing relative to 1. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 0.5.0",
   "language": "julia",
   "name": "julia-0.5"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}