Untitled

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Experimenting with Dask Imperatives and Distributed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To start a `Distributed` cluster with 4 workers, I run:\n",
    "```\n",
    "dcenter & \\\n",
    "  dworker 127.0.0.1:8787 & \\\n",
    "  dworker 127.0.0.1:8787 & \\\n",
    "  dworker 127.0.0.1:8787 & \\\n",
    "  dworker 127.0.0.1:8787 &\n",
    "```\n",
    "\n",
    "and to stop it, the easiest way seems to be:\n",
    "```\n",
    "ps aux | grep python | grep dcenter | awk '{print $2}' | xargs kill\n",
    "ps aux | grep python | grep dworker | awk '{print $2}' | xargs kill\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Having done this, we can create a `distributed` executor for this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from distributed import Executor\n",
    "executor = Executor('127.0.0.1:8787')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a dummy workload"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code simulates a distributed workload that runs a series of \"expensive\" operations to generate an ordered collection of images.  The idea is that these operations are expensive enough and/or numerous enough to justify computing them on a cluster rather than on a single, powerful (shared memory) workstation.  For fast visualization purposes, I then normalize these images, apply a colormap, and then pull them down to the master node where they can be saved as frames of a movie.\n",
    "\n",
    "By the time I colormap these images and fetch the images down to the executor, each image contains roughly 2MB of data.  In this example, I generate just 100 of them, so there should just be about 200MB of data to transfer from workers.  In a real application I might generate quite a lot more data than this, perhaps several gigabytes worth of images that I'd fetch from the cluster.  Even in this \"scaled up\" scenario, the size of each individual image to be fetched should be a manageable size (in the 1-10MB range)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "IMAGE_SIZE = (512,512)\n",
    "NUM_Z_SLICES = 100\n",
    "Z_RANGE = (-10e-6, 10e-6)\n",
    "MOVIE_Z_SAMPLES = np.linspace(Z_RANGE[0], Z_RANGE[1], NUM_Z_SLICES)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the functions that make up our workload.  This is an extremely parallelizable workload that nonetheless has several dependencies between computational \"layers.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def compute_test_image(z):\n",
    "    \"\"\"\n",
    "    Imagine that this is a function that takes a while to run on a cluster. \n",
    "    \n",
    "    Returns an image of type complex64.\n",
    "    \"\"\"\n",
    "    re_data = np.random.random(IMAGE_SIZE)\n",
    "    im_data = np.random.random(IMAGE_SIZE)\n",
    "    \n",
    "    result = re_data + 1j * im_data\n",
    "    \n",
    "    # Do a dummy FFT 10 times, to simulate some real work\n",
    "    for i in range(10):\n",
    "        result = np.fft.fftshift(np.fft.fft2(np.fft.ifftshift(result)))\n",
    "\n",
    "    return result\n",
    "\n",
    "def amplitude_range(im):\n",
    "    amplitude = np.abs(im)\n",
    "    return (amplitude.min(), amplitude.max())\n",
    "\n",
    "def phase_range(im):\n",
    "    phase = np.angle(im)\n",
    "    return (phase.min(), phase.max())\n",
    "\n",
    "def colormap_and_combine(im, amplitude_range, phase_range):\n",
    "    amplitude = np.abs(im) \n",
    "    amplitude /= amplitude_range[1]\n",
    "    \n",
    "    phase = np.angle(im)\n",
    "    phase = (phase - phase_range[0]) / (phase_range[1] - phase_range[0])\n",
    "    \n",
    "    from matplotlib import cm\n",
    "    amplitude_cmap = cm.hot( amplitude )\n",
    "    phase_cmap = cm.coolwarm( phase )\n",
    "    \n",
    "    from matplotlib.colors import rgb_to_hsv, hsv_to_rgb\n",
    "    hsv = rgb_to_hsv(phase_cmap[:,:,:3])\n",
    "    hsv[:,:,2] = np.power(amplitude, 0.5)\n",
    "    phase_cmap[:,:,:3] = hsv_to_rgb(hsv)\n",
    "\n",
    "    result = np.zeros((im.shape[0], 2 * im.shape[1], 4), dtype = np.float32)\n",
    "    result[0:, :im.shape[1]] = amplitude_cmap\n",
    "    result[0:, im.shape[1]:] = phase_cmap\n",
    "    \n",
    "    result *= 255\n",
    "\n",
    "    return result.astype(np.uint8)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a computational graph using Dask imperatives.  (So cool!)   This computation is deferred until I actually call `compute()` below.  \n",
    "\n",
    "The description of the computation and the execution engine are nicely abstracted from each other in Dask, so I can run this same computation on a variety of execution engines below and compare their relative performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from dask.imperative import do, value\n",
    "\n",
    "test_images        = [ do(compute_test_image)(z) for z in MOVIE_Z_SAMPLES ]\n",
    "amplitude_ranges   = [ do(amplitude_range)(im) for im in test_images ]\n",
    "phase_ranges       = [ do(phase_range)(im) for im in test_images ]\n",
    "\n",
    "colormapped_images = value([ do(colormap_and_combine)(*args) for args in zip(test_images,amplitude_ranges, phase_ranges) ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tests Round #1: Fetching individual results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These tests measure the time it takes to compute a single colormapped image."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Test #1.1: Run the computation locally using the `dask.threaded` execution engine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 29 s, sys: 6.03 s, total: 35 s\n",
      "Wall time: 10.2 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import dask.threaded\n",
    "test1 = colormapped_images[0].compute(get = dask.threaded.get, num_workers = 4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Comments:** This is the baseline case, where a single result is computed.  The results are already in shared memory so there is no communication overhead in this example.  This provides a sense for how much time is spent computing each output image."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Test #1.2: Run the computation locally using the `dask.multiprocessing` execution engine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.35 s, sys: 3.88 s, total: 6.24 s\n",
      "Wall time: 18.1 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import dask.multiprocessing\n",
    "test2 = colormapped_images[0].compute(get = dask.multiprocessing.get, num_workers = 4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Comments:** This take quite a lot longer.  Some overhead must be due to the creation of a multiprocessing pool, and some overhead is presumably due to inter-process communication, as the results are fetched back to the master process.  The overhead is significant in this example."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Test #1.3: Run the computation on a local cluster using the `distributed.executor.get` computation engine."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first time we run this, the results take a little while to compute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 895 ms, sys: 152 ms, total: 1.05 s\n",
      "Wall time: 20.7 s\n"
     ]
    }
   ],
   "source": [
    "%%time \n",
    "test3 = colormapped_images[0].compute(get = executor.get)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Comments: ** The communications overhead is apparent here, but this runs faster than the `dask.multiprocessing` test above.  I imagine this is partially due to the fact that the distributed workers are already running, whereas `dask.multiprocessing` needs to spin up a worker pool.  Perhaps interprocess communications are a bit faster here as well?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tests: Fetching all results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Test #2.1: Run the computation locally using the `dask.threaded` execution engine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 28.9 s, sys: 6.07 s, total: 35 s\n",
      "Wall time: 10.3 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import dask.threaded\n",
    "\n",
    "results = colormapped_images.compute(get = dask.threaded.get, num_workers=4);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Comments:** this test computation is small enough that I can run it locally on my machine to show the best case.  This runs extremely quickly, since the computation itself is carried out in parallel across 4 cores on my laptop.  The results already reside in shared memory, so basically zero time is spent \"fetching\" the results in this case.  This is a best case with nearly no communication overhead."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Test #2: Run the computation locally using the `dask.multiprocessing` execution engine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.72 s, sys: 25.9 s, total: 28.6 s\n",
      "Wall time: 56.8 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import dask.multiprocessing\n",
    "\n",
    "results = colormapped_images.compute(get = dask.multiprocessing.get, num_workers=4);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Comments: ** In this test case, I see my CPU usage climb up to about 300% for about 10 seconds, and then it drops down to 100% for the remaining 1m17s seconds after that.  Presumably the master process is collecting results from the workers during this second phase, and is using the full CPU resources it has available to it to do so.  I expect this data transfer to take some time, but 1m17s is quite a bit longer that I would expect it should take to transfer 100MB in 2MB chunks locally between processes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Test #3: Run the computation on a local cluster using the `distributed.executor.get` computation engine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 42.1 s, sys: 26.2 s, total: 1min 8s\n",
      "Wall time: 1min 27s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "results = colormapped_images.compute(get = executor.get);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Comments: ** Again, CPU usage spikes initially for about 10s as results are computed (each of the four workers takes about 150% of CPU), and then drops down to 100% as a single Python process (I assume the executor) gathers results.\n",
    "\n",
    "The communications overhead is more here than the `dask.multiprocessing` test above.  However, when I run the same test with 50 images, this test actually collects results faster than the `dask.multiprocessing` test.  Strange.  Either way, the communication overhead is more than I would expect for transferring 200MB worth of data locally on a single machine.  Compression would speed this up, of course, but even so the data transfer seems "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}