SHARE
TWEET

Untitled

a guest Sep 16th, 2019 64 Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. # How I think Algodiff Works
  2.  
  3. To train a network, we need to first use `train_generic`:
  4.  
  5. ```ocaml
  6. let train_generic ?state ?params ?(init_model=true) nn x y =
  7.     if init_model = true then init nn;
  8.     let f = forward nn in
  9.     let b = backward nn in
  10.     let u = update nn in
  11.     let s = save nn in
  12.     let p = match params with
  13.       | Some p -> p
  14.       | None   -> Optimise.Params.default ()
  15.     in
  16.     Optimise.minimise_network ?state p f b u s x y
  17. ```
  18.  
  19. We can see that it is actually a wrapper around `minimise_network` :
  20.  
  21. ```ocaml
  22. let minimise_network ?state params forward backward update save x y =
  23.     ...
  24.     let iterate i =
  25.       let xt, yt = bach_fun x y i in
  26.       let yt', ws = forward xt in
  27.       let loss = loss_fun yt yt' in
  28.       let loss = Maths.(loss / (_f (Mat.row_num yt |> float_of_int))) in
  29.       let reg = ...
  30.       let loss = Maths.(loss + reg) in
  31.       let ws, gs' = backward loss in
  32.       loss |> primal', ws, gs'
  33.     in
  34.  
  35.     ...
  36.  
  37.     while Checkpoint.(state.stop = false) do
  38.       let loss', ws, gs' = iterate Checkpoint.(state.current_batch) in
  39.       ...
  40.     done
  41. ```
  42.  
  43. What are the `forward` and `backward` functions here?
  44.  
  45. ```ocaml
  46. let forward nn x = mktag (tag ()) nn; run x nn, mkpar nn
  47.  
  48. let backward nn y = reverse_prop (_f 1.) y; mkpri nn, mkadj nn
  49. ```
  50.  
  51. These two operations are called at each iteration. Now we can see the process from a high level.
  52.  
  53. ## Forward Phase
  54.  
  55. First, differentiate between two kinds of neurons: the ones that contain weights to update, such as Conv2D, and I call them **Type A** neurons; the rest that only do calculation, such as MaxPool2D, and I call it **Type B** neurons. Hereafter I'll use "layer" and "neuron" interchangeably.
  56. Each neuron normally contains several *nodes*, each node has type `t`:
  57.  
  58. ```ocaml
  59. type t =
  60.   | F   of A.elt                                  (* constructor of float numbers *)
  61.   | Arr of A.arr                                  (* constructor of ndarrays *)
  62.   | DF  of t * t * int                            (* primal, tangent, tag *)
  63.   | DR  of t * t ref * trace_op * int ref * int   (* primal, adjoint, op, fanout, tag *)
  64. ```
  65. For example, if the `run` function of one node is `Maths.(relu (x - _f a))`, it will generate 3 nodes. I don't think we need to use `DF` value here; I also understand the primal value of a `DR` as the value of this operation itself, and adjoint value as gradient hereafter.
  66.  
  67. (*wait, what is `primal` anyway? weight? output value? The graph is definitely hold in `op` not here...*)
  68.  
  69.  
  70. 1. `mktag (tag ()) nn`: for each layer in a neural network, if it is Type A, change each of its parameters to a `DR` value by calling `make_reverse`; if it is a type B neuron, do nothing.
  71.  
  72. ```ocaml
  73. let make_reverse p i = DR (p, ref (zero p), Noop, ref 0, i)
  74. ```
  75.  
  76.  Note that the `DR` values created here are ALL `Noop` operations, which means they are nothing more than placeholders at this stage.
  77.  
  78. 2. `run x nn`: connect all the existing operations into a graph by running this network layer by layer, regardless whether it's type A or B. The whole graph is accumulated to the output node. Note that the `run` function of each neuron uses operation from `Algodiff.Maths` rather than normal math operations. Let's look an example :
  79.  
  80. ```ocaml
  81. (* module Conv2D *)
  82. let run x l = Maths.((conv2d ~padding:l.padding x l.w l.stride) + l.b)
  83. ```
  84.  
  85. Here both `l.w` and `l.b` are already set to `DR` placeholders. `x` is a `t` output value from the previous neuron. How the `conv2d` operation is implemented in Algodiff then?
  86. ```ocaml
  87. and conv2d ?padding a b s =
  88.   let ff a b =
  89.     match a, b with
  90.     | Arr a, Arr b -> Arr A.(conv2d ?padding a b s)
  91.     | _            -> error_binop "conv2d" a b
  92.   in
  93.   let fd a b = conv2d ?padding a b s in
  94.   ...
  95.   let r_d_c a b = Conv2D_D_C (a, b, s) in
  96.   op_d_d_d a b ff fd _ _ _ _ r_d_c _
  97. ```
  98. Here `a` and `b` are input and kernel respectively, both of type `t`. For simplicity, we only look at the case where input is `DR` and `Kernel` is a constant `Arr`:
  99. (*One question: wait, you just said that the `l.w` is already set to `DR` in the previous step, how then could it be `Arr` now? What has changed it?*)
  100.  
  101. ```ocaml
  102. and op_d_d_d a b ff fd df_da df_db df_dab r_d_d r_d_c r_c_d =
  103.   match a, b with
  104.   | DR (ap, _, _, _, ai), Arr _bp -> let cp = fd ap b in DR (cp, ref (zero cp), r_d_c a b, ref 0, ai)
  105.   |...
  106. ```
  107.  
  108. So what `Maths.conv2d` does is this: first calculate the result value by updating the existing primal value of a DR (`let cp = fd ap b` => `let cp = conv2d ?padding ap b s`), ignore the gradient value (just set to zero: `ref (zero cp)`), set the `Noop` operation to `Conv2D_D_C` (`r_d_c a b`), and set the tag as is (`ai`).
  109.  
  110. *What I don't quite understand is function `fd`; it calls it self, why? I temporarily interpret it as "calculating result value", but it is quite likely wrong.*
  111.  
  112. The translation of Type B neuron is similar. For example, for the `Maxpool2D` neuron:
  113.  
  114. ```ocaml
  115. let run x l = Maths.(max_pool2d l.padding x l.kernel l.stride)
  116.  
  117. and max_pool2d padding a b s =
  118.   let ff = function
  119.     | Arr a    -> Arr A.(max_pool2d ~padding a b s)
  120.     | _        -> error_uniop "max_pool2d" a
  121.   in
  122.   let fd a = max_pool2d padding a b s in
  123.   let df _cp _ap _at = failwith "max_pool2d:df" in
  124.   let r a = Maxpool2D_D (a, padding, b, s) in
  125.   op_d_d a ff fd df r
  126.  
  127. and op_d_d a ff fd df r =
  128.   match a with
  129.   | DF (ap, at, ai)      -> ...
  130.   | DR (ap, _, _, _, ai) -> let cp = fd ap in DR (cp, ref (zero cp), r a, ref 0, ai)
  131.   | ap                   -> ff ap
  132. ```
  133.  
  134. If the input is `DR`, then this operation similarly adds a `DR` node to the graph; otherwise the `ff` is called, and a `Arr` node is added.
  135.  
  136.  
  137. After a forward pass is finished, we get one output `DR` value. But it's much more than an output value; it actually contains a whole computation graph in its `op`:
  138.  
  139. ```ocaml
  140. and trace_op =
  141.   | Conv1D_D_C  of t * t * int array
  142.   | Maxpool2D_D of t * padding * int array * int array
  143.   ...
  144.  
  145. ```
  146. The output of each `run` function is accumulated in this way into a graph of `t`s.
  147.  
  148. 3. The final step `mkpar nn` is simple: return the parameters of each layer in an array, which is `t array array`.
  149.  
  150.  
  151. ## Backward Phase
  152.  
  153. We have already get the graph in the output `DR` value `y` from forward pass; now let's apply backward step on it. Note that the backward step is actually applied on a `loss` value, which append some extra nodes at the end of the output graph from forward pass, but let's ignore it for now.
  154.  
  155. Starting from `reverse_prop (_f 1.) y`, which simply comprises of two steps:
  156.  
  157. ```ocaml
  158. let reverse_prop v x =
  159.   reverse_reset x;
  160.   reverse_push v x
  161.  
  162. let reverse_reset x =
  163.   let rec reset xs =
  164.     match xs with
  165.     | [] -> ()
  166.     | x :: t ->
  167.       | DR (_ap, aa, ao, af, _ai) -> (
  168.         aa := reset_zero !aa;
  169.         match ao with
  170.         | Noop -> reset t
  171.           ....)
  172.         else reset t
  173.         )
  174.       | _ -> reset t
  175.   in
  176.   reset [x]
  177.  
  178. let reverse\_push v x =
  179.   let open Maths in
  180.   let rec push xs =
  181.     match xs with
  182.     | []          -> ()
  183.     | (v, x) :: t -> (
  184.         match x with
  185.         | DR (ap, aa, ao, af, _ai) -> (
  186.             aa := Maths.(!aa + v);
  187.             match ao with
  188.             | Noop -> push t
  189.             | Conv2D_D_C (a, b, s)  -> push ((conv2d_backward_input a b s !aa, a) :: t)
  190.             | ....
  191.           )
  192.         | _ -> push t
  193.       )
  194.   in
  195.   push [(v, x)]
  196. ```
  197.  
  198. 1. No magic for the `reverse_reset` function. Starting from the root node `y`, for each node: 1) set my own gradient to 0; 2) add my parents to the Stack; 3) process the first element of the Stack until it is empty.
  199.  
  200. 2. `reverse_push` is little bit complex but similar. Starting from `(v, y)`, where v for `y` is 1, for each node: 1) update my gradient by adding my current gradient with `v`; 2) calculate the `v` for my parents; 3) add `(v, parent)` to the Stack; 4) process the first element of the Stack until the it is empty.
  201.   In both steps, if the node is not a `DR`, then just ignore it.
  202.  
  203. 3. After one backward step, the gradient of each node is updated.
  204.  
  205. The rest is easy: `mkpri nn, mkadj nn`: get the weight value and gradient of each node in arrays if it contains any.
RAW Paste Data
We use cookies for various purposes including analytics. By continuing to use Pastebin, you agree to our use of cookies as described in the Cookies Policy. OK, I Understand
 
Top