Untitled

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Knet RNN example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# After installing and starting Julia run the following to install the required packages:\n",
    "# Pkg.init(); Pkg.update()\n",
    "# for p in (\"CUDAdrv\",\"IJulia\",\"PyCall\",\"JLD2\",\"Knet\"); Pkg.add(p); end\n",
    "# Pkg.checkout(\"Knet\",\"ilkarman\") # make sure we have the right Knet version\n",
    "# Pkg.build(\"Knet\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Knet\n",
    "True=true # so we can read the python params\n",
    "include(\"common/params_lstm.py\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "OS: Linux\n",
      "Julia: 0.6.2\n",
      "Knet: 0.9.0+\n",
      "GPU: Tesla K80\n",
      "\n"
     ]
    }
   ],
   "source": [
    "println(\"OS: \", Sys.KERNEL)\n",
    "println(\"Julia: \", VERSION)\n",
    "println(\"Knet: \", Pkg.installed(\"Knet\"))\n",
    "println(\"GPU: \", readstring(`nvidia-smi --query-gpu=name --format=csv,noheader`))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```\n",
       "rnninit(inputSize, hiddenSize; opts...)\n",
       "```\n",
       "\n",
       "Return an `(r,w)` pair where `r` is a RNN struct and `w` is a single weight array that includes all matrices and biases for the RNN. Keyword arguments:\n",
       "\n",
       "  * `rnnType=:lstm` Type of RNN: One of :relu, :tanh, :lstm, :gru.\n",
       "  * `numLayers=1`: Number of RNN layers.\n",
       "  * `bidirectional=false`: Create a bidirectional RNN if `true`.\n",
       "  * `dropout=0.0`: Dropout probability. Ignored if `numLayers==1`.\n",
       "  * `skipInput=false`: Do not multiply the input with a matrix if `true`.\n",
       "  * `dataType=Float32`: Data type to use for weights.\n",
       "  * `algo=0`: Algorithm to use, see CUDNN docs for details.\n",
       "  * `seed=0`: Random number seed. Uses `time()` if 0.\n",
       "  * `winit=xavier`: Weight initialization method for matrices.\n",
       "  * `binit=zeros`: Weight initialization method for bias vectors.\n",
       "  * `usegpu=(gpu()>=0): GPU used by default if one exists.\n",
       "\n",
       "RNNs compute the output h[t] for a given iteration from the recurrent input h[t-1] and the previous layer input x[t] given matrices W, R and biases bW, bR from the following equations:\n",
       "\n",
       "`:relu` and `:tanh`: Single gate RNN with activation function f:\n",
       "\n",
       "```\n",
       "h[t] = f(W * x[t] .+ R * h[t-1] .+ bW .+ bR)\n",
       "```\n",
       "\n",
       "`:gru`: Gated recurrent unit:\n",
       "\n",
       "```\n",
       "i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate\n",
       "r[t] = sigm(Wr * x[t] .+ Rr * h[t-1] .+ bWr .+ bRr) # reset gate\n",
       "n[t] = tanh(Wn * x[t] .+ r[t] .* (Rn * h[t-1] .+ bRn) .+ bWn) # new gate\n",
       "h[t] = (1 - i[t]) .* n[t] .+ i[t] .* h[t-1]\n",
       "```\n",
       "\n",
       "`:lstm`: Long short term memory unit with no peephole connections:\n",
       "\n",
       "```\n",
       "i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate\n",
       "f[t] = sigm(Wf * x[t] .+ Rf * h[t-1] .+ bWf .+ bRf) # forget gate\n",
       "o[t] = sigm(Wo * x[t] .+ Ro * h[t-1] .+ bWo .+ bRo) # output gate\n",
       "n[t] = tanh(Wn * x[t] .+ Rn * h[t-1] .+ bWn .+ bRn) # new gate\n",
       "c[t] = f[t] .* c[t-1] .+ i[t] .* n[t]               # cell output\n",
       "h[t] = o[t] .* tanh(c[t])\n",
       "```\n"
      ],
      "text/plain": [
       "```\n",
       "rnninit(inputSize, hiddenSize; opts...)\n",
       "```\n",
       "\n",
       "Return an `(r,w)` pair where `r` is a RNN struct and `w` is a single weight array that includes all matrices and biases for the RNN. Keyword arguments:\n",
       "\n",
       "  * `rnnType=:lstm` Type of RNN: One of :relu, :tanh, :lstm, :gru.\n",
       "  * `numLayers=1`: Number of RNN layers.\n",
       "  * `bidirectional=false`: Create a bidirectional RNN if `true`.\n",
       "  * `dropout=0.0`: Dropout probability. Ignored if `numLayers==1`.\n",
       "  * `skipInput=false`: Do not multiply the input with a matrix if `true`.\n",
       "  * `dataType=Float32`: Data type to use for weights.\n",
       "  * `algo=0`: Algorithm to use, see CUDNN docs for details.\n",
       "  * `seed=0`: Random number seed. Uses `time()` if 0.\n",
       "  * `winit=xavier`: Weight initialization method for matrices.\n",
       "  * `binit=zeros`: Weight initialization method for bias vectors.\n",
       "  * `usegpu=(gpu()>=0): GPU used by default if one exists.\n",
       "\n",
       "RNNs compute the output h[t] for a given iteration from the recurrent input h[t-1] and the previous layer input x[t] given matrices W, R and biases bW, bR from the following equations:\n",
       "\n",
       "`:relu` and `:tanh`: Single gate RNN with activation function f:\n",
       "\n",
       "```\n",
       "h[t] = f(W * x[t] .+ R * h[t-1] .+ bW .+ bR)\n",
       "```\n",
       "\n",
       "`:gru`: Gated recurrent unit:\n",
       "\n",
       "```\n",
       "i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate\n",
       "r[t] = sigm(Wr * x[t] .+ Rr * h[t-1] .+ bWr .+ bRr) # reset gate\n",
       "n[t] = tanh(Wn * x[t] .+ r[t] .* (Rn * h[t-1] .+ bRn) .+ bWn) # new gate\n",
       "h[t] = (1 - i[t]) .* n[t] .+ i[t] .* h[t-1]\n",
       "```\n",
       "\n",
       "`:lstm`: Long short term memory unit with no peephole connections:\n",
       "\n",
       "```\n",
       "i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate\n",
       "f[t] = sigm(Wf * x[t] .+ Rf * h[t-1] .+ bWf .+ bRf) # forget gate\n",
       "o[t] = sigm(Wo * x[t] .+ Ro * h[t-1] .+ bWo .+ bRo) # output gate\n",
       "n[t] = tanh(Wn * x[t] .+ Rn * h[t-1] .+ bWn .+ bRn) # new gate\n",
       "c[t] = f[t] .* c[t-1] .+ i[t] .* n[t]               # cell output\n",
       "h[t] = o[t] .* tanh(c[t])\n",
       "```\n"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@doc rnninit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define model\n",
    "function initmodel()\n",
    "    rnnSpec,rnnWeights = rnninit(EMBEDSIZE,NUMHIDDEN; rnnType=:gru)\n",
    "    inputMatrix = KnetArray(xavier(Float32,EMBEDSIZE,MAXFEATURES))\n",
    "    outputMatrix = KnetArray(xavier(Float32,2,NUMHIDDEN))\n",
    "    return rnnSpec,(rnnWeights,inputMatrix,outputMatrix)\n",
    "end;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```\n",
       "rnnforw(r, w, x[, hx, cx]; batchSizes, hy, cy)\n",
       "```\n",
       "\n",
       "Returns a tuple (y,hyout,cyout,rs) given rnn `r`, weights `w`, input `x` and optionally the initial hidden and cell states `hx` and `cx` (`cx` is only used in LSTMs).  `r` and `w` should come from a previous call to `rnninit`.  Both `hx` and `cx` are optional, they are treated as zero arrays if not provided.  The output `y` contains the hidden states of the final layer for each time step, `hyout` and `cyout` give the final hidden and cell states for all layers, `rs` is a buffer the RNN needs for its gradient calculation.\n",
       "\n",
       "The boolean keyword arguments `hy` and `cy` control whether `hyout` and `cyout` will be output.  By default `hy = (hx!=nothing)` and `cy = (cx!=nothing && r.mode==2)`, i.e. a hidden state will be output if one is provided as input and for cell state we also require an LSTM.  If `hy`/`cy` is `false`, `hyout`/`cyout` will be `nothing`. `batchSizes` can be an integer array that specifies non-uniform batch sizes as explained below. By default `batchSizes=nothing` and the same batch size, `size(x,2)`, is used for all time steps.\n",
       "\n",
       "The input and output dimensions are:\n",
       "\n",
       "  * `x`: (X,[B,T])\n",
       "  * `y`: (H/2H,[B,T])\n",
       "  * `hx`,`cx`,`hyout`,`cyout`: (H,B,L/2L)\n",
       "  * `batchSizes`: `nothing` or `Vector{Int}(T)`\n",
       "\n",
       "where X is inputSize, H is hiddenSize, B is batchSize, T is seqLength, L is numLayers.  `x` can be 1, 2, or 3 dimensional.  If `batchSizes==nothing`, a 1-D `x` represents a single instance, a 2-D `x` represents a single minibatch, and a 3-D `x` represents a sequence of identically sized minibatches.  If `batchSizes` is an array of (non-increasing) integers, it gives us the batch size for each time step in the sequence, in which case `sum(batchSizes)` should equal `div(length(x),size(x,1))`. `y` has the same dimensionality as `x`, differing only in its first dimension, which is H if the RNN is unidirectional, 2H if bidirectional.  Hidden vectors `hx`, `cx`, `hyout`, `cyout` all have size (H,B1,L) for unidirectional RNNs, and (H,B1,2L) for bidirectional RNNs where B1 is the size of the first minibatch.\n"
      ],
      "text/plain": [
       "```\n",
       "rnnforw(r, w, x[, hx, cx]; batchSizes, hy, cy)\n",
       "```\n",
       "\n",
       "Returns a tuple (y,hyout,cyout,rs) given rnn `r`, weights `w`, input `x` and optionally the initial hidden and cell states `hx` and `cx` (`cx` is only used in LSTMs).  `r` and `w` should come from a previous call to `rnninit`.  Both `hx` and `cx` are optional, they are treated as zero arrays if not provided.  The output `y` contains the hidden states of the final layer for each time step, `hyout` and `cyout` give the final hidden and cell states for all layers, `rs` is a buffer the RNN needs for its gradient calculation.\n",
       "\n",
       "The boolean keyword arguments `hy` and `cy` control whether `hyout` and `cyout` will be output.  By default `hy = (hx!=nothing)` and `cy = (cx!=nothing && r.mode==2)`, i.e. a hidden state will be output if one is provided as input and for cell state we also require an LSTM.  If `hy`/`cy` is `false`, `hyout`/`cyout` will be `nothing`. `batchSizes` can be an integer array that specifies non-uniform batch sizes as explained below. By default `batchSizes=nothing` and the same batch size, `size(x,2)`, is used for all time steps.\n",
       "\n",
       "The input and output dimensions are:\n",
       "\n",
       "  * `x`: (X,[B,T])\n",
       "  * `y`: (H/2H,[B,T])\n",
       "  * `hx`,`cx`,`hyout`,`cyout`: (H,B,L/2L)\n",
       "  * `batchSizes`: `nothing` or `Vector{Int}(T)`\n",
       "\n",
       "where X is inputSize, H is hiddenSize, B is batchSize, T is seqLength, L is numLayers.  `x` can be 1, 2, or 3 dimensional.  If `batchSizes==nothing`, a 1-D `x` represents a single instance, a 2-D `x` represents a single minibatch, and a 3-D `x` represents a sequence of identically sized minibatches.  If `batchSizes` is an array of (non-increasing) integers, it gives us the batch size for each time step in the sequence, in which case `sum(batchSizes)` should equal `div(length(x),size(x,1))`. `y` has the same dimensionality as `x`, differing only in its first dimension, which is H if the RNN is unidirectional, 2H if bidirectional.  Hidden vectors `hx`, `cx`, `hyout`, `cyout` all have size (H,B1,L) for unidirectional RNNs, and (H,B1,2L) for bidirectional RNNs where B1 is the size of the first minibatch.\n"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@doc rnnforw"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define loss and its gradient\n",
    "function predict(weights, inputs, rnnSpec)\n",
    "    rnnWeights, inputMatrix, outputMatrix = weights # (1,1,W), (X,V), (2,H)\n",
    "    indices = hcat(inputs...)' # (B,T)\n",
    "    rnnInput = inputMatrix[:,indices] # (X,B,T)\n",
    "    rnnOutput = rnnforw(rnnSpec, rnnWeights, rnnInput)[1] # (H,B,T)\n",
    "    return outputMatrix * rnnOutput[:,:,end] # (2,H) * (H,B) = (2,B)\n",
    "end\n",
    "\n",
    "loss(w,x,y,r)=nll(predict(w,x,r),y)\n",
    "lossgradient = grad(loss);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[1m\u001b[36mINFO: \u001b[39m\u001b[22m\u001b[36mLoading IMDB...\n",
      "\u001b[39m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 10.383756 seconds (15.84 M allocations: 830.528 MiB, 4.02% gc time)\n",
      "25000-element Array{Array{Int32,1},1}\n",
      "25000-element Array{Int8,1}\n",
      "25000-element Array{Array{Int32,1},1}\n",
      "25000-element Array{Int8,1}\n"
     ]
    }
   ],
   "source": [
    "# load data\n",
    "include(Knet.dir(\"data\",\"imdb.jl\"))\n",
    "@time (xtrn,ytrn,xtst,ytst,imdbdict)=imdb(maxlen=MAXLEN,maxval=MAXFEATURES)\n",
    "for d in (xtrn,ytrn,xtst,ytst); println(summary(d)); end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "150-element Array{String,1}:\n",
       " \"sharp\"      \n",
       " \"engrossing\" \n",
       " \"and\"        \n",
       " \"perceptive\" \n",
       " \"examination\"\n",
       " \"of\"         \n",
       " \"suburban\"   \n",
       " \"angst\"      \n",
       " \"and\"        \n",
       " \"the\"        \n",
       " \"limitations\"\n",
       " \"of\"         \n",
       " \"the\"        \n",
       " ⋮            \n",
       " \"both\"       \n",
       " \"on\"         \n",
       " \"the\"        \n",
       " \"money\"      \n",
       " \"solid\"      \n",
       " \"and\"        \n",
       " \"effective\"  \n",
       " \"recommended\"\n",
       " \"viewing\"    \n",
       " \"for\"        \n",
       " \"sarno\"      \n",
       " \"fans\"       "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "imdbarray = Array{String}(88584)\n",
    "for (k,v) in imdbdict; imdbarray[v]=k; end\n",
    "imdbarray[xtrn[1]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# prepare for training\n",
    "weights = nothing; knetgc(); # Reclaim memory from previous run\n",
    "rnnSpec,weights = initmodel()\n",
    "optim = optimizers(weights, Adam; lr=LR, beta1=BETA_1, beta2=BETA_2, eps=EPS);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 16.016910 seconds (2.05 M allocations: 137.400 MiB, 3.30% gc time)\n"
     ]
    }
   ],
   "source": [
    "# cold start\n",
    "@time for (x,y) in minibatch(xtrn,ytrn,BATCHSIZE;shuffle=true)\n",
    "    grads = lossgradient(weights,x,y,rnnSpec)\n",
    "    update!(weights, grads, optim)\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# prepare for training\n",
    "weights = nothing; knetgc(); # Reclaim memory from previous run\n",
    "rnnSpec,weights = initmodel()\n",
    "optim = optimizers(weights, Adam; lr=LR, beta1=BETA_1, beta2=BETA_2, eps=EPS);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[1m\u001b[36mINFO: \u001b[39m\u001b[22m\u001b[36mTraining...\n",
      "\u001b[39m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " 10.263796 seconds (358.68 k allocations: 45.038 MiB, 4.63% gc time)\n",
      "  9.550875 seconds (354.17 k allocations: 44.687 MiB, 6.23% gc time)\n",
      "  9.575668 seconds (354.89 k allocations: 44.699 MiB, 6.32% gc time)\n",
      " 29.397045 seconds (1.07 M allocations: 134.575 MiB, 5.70% gc time)\n"
     ]
    }
   ],
   "source": [
    "# 29s\n",
    "info(\"Training...\")\n",
    "@time for epoch in 1:EPOCHS\n",
    "    @time for (x,y) in minibatch(xtrn,ytrn,BATCHSIZE;shuffle=true)\n",
    "        grads = lossgradient(weights,x,y,rnnSpec)\n",
    "        update!(weights, grads, optim)\n",
    "    end\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[1m\u001b[36mINFO: \u001b[39m\u001b[22m\u001b[36mTesting...\n",
      "\u001b[39m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  3.780345 seconds (737.73 k allocations: 70.577 MiB, 3.82% gc time)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.8530448717948718"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "info(\"Testing...\")\n",
    "@time accuracy(weights, minibatch(xtst,ytst,BATCHSIZE), (w,x)->predict(w,x,rnnSpec))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 0.6.2",
   "language": "julia",
   "name": "julia-0.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}