Advertisement
Guest User

Untitled

a guest
Sep 16th, 2019
93
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 7.94 KB | None | 0 0
  1. # How I think Algodiff Works
  2.  
  3. To train a network, we need to first use `train_generic`:
  4.  
  5. ```ocaml
  6. let train_generic ?state ?params ?(init_model=true) nn x y =
  7. if init_model = true then init nn;
  8. let f = forward nn in
  9. let b = backward nn in
  10. let u = update nn in
  11. let s = save nn in
  12. let p = match params with
  13. | Some p -> p
  14. | None -> Optimise.Params.default ()
  15. in
  16. Optimise.minimise_network ?state p f b u s x y
  17. ```
  18.  
  19. We can see that it is actually a wrapper around `minimise_network` :
  20.  
  21. ```ocaml
  22. let minimise_network ?state params forward backward update save x y =
  23. ...
  24. let iterate i =
  25. let xt, yt = bach_fun x y i in
  26. let yt', ws = forward xt in
  27. let loss = loss_fun yt yt' in
  28. let loss = Maths.(loss / (_f (Mat.row_num yt |> float_of_int))) in
  29. let reg = ...
  30. let loss = Maths.(loss + reg) in
  31. let ws, gs' = backward loss in
  32. loss |> primal', ws, gs'
  33. in
  34.  
  35. ...
  36.  
  37. while Checkpoint.(state.stop = false) do
  38. let loss', ws, gs' = iterate Checkpoint.(state.current_batch) in
  39. ...
  40. done
  41. ```
  42.  
  43. What are the `forward` and `backward` functions here?
  44.  
  45. ```ocaml
  46. let forward nn x = mktag (tag ()) nn; run x nn, mkpar nn
  47.  
  48. let backward nn y = reverse_prop (_f 1.) y; mkpri nn, mkadj nn
  49. ```
  50.  
  51. These two operations are called at each iteration. Now we can see the process from a high level.
  52.  
  53. ## Forward Phase
  54.  
  55. First, differentiate between two kinds of neurons: the ones that contain weights to update, such as Conv2D, and I call them **Type A** neurons; the rest that only do calculation, such as MaxPool2D, and I call it **Type B** neurons. Hereafter I'll use "layer" and "neuron" interchangeably.
  56. Each neuron normally contains several *nodes*, each node has type `t`:
  57.  
  58. ```ocaml
  59. type t =
  60. | F of A.elt (* constructor of float numbers *)
  61. | Arr of A.arr (* constructor of ndarrays *)
  62. | DF of t * t * int (* primal, tangent, tag *)
  63. | DR of t * t ref * trace_op * int ref * int (* primal, adjoint, op, fanout, tag *)
  64. ```
  65. For example, if the `run` function of one node is `Maths.(relu (x - _f a))`, it will generate 3 nodes. I don't think we need to use `DF` value here; I also understand the primal value of a `DR` as the value of this operation itself, and adjoint value as gradient hereafter.
  66.  
  67. (*wait, what is `primal` anyway? weight? output value? The graph is definitely hold in `op` not here...*)
  68.  
  69.  
  70. 1. `mktag (tag ()) nn`: for each layer in a neural network, if it is Type A, change each of its parameters to a `DR` value by calling `make_reverse`; if it is a type B neuron, do nothing.
  71.  
  72. ```ocaml
  73. let make_reverse p i = DR (p, ref (zero p), Noop, ref 0, i)
  74. ```
  75.  
  76. Note that the `DR` values created here are ALL `Noop` operations, which means they are nothing more than placeholders at this stage.
  77.  
  78. 2. `run x nn`: connect all the existing operations into a graph by running this network layer by layer, regardless whether it's type A or B. The whole graph is accumulated to the output node. Note that the `run` function of each neuron uses operation from `Algodiff.Maths` rather than normal math operations. Let's look an example :
  79.  
  80. ```ocaml
  81. (* module Conv2D *)
  82. let run x l = Maths.((conv2d ~padding:l.padding x l.w l.stride) + l.b)
  83. ```
  84.  
  85. Here both `l.w` and `l.b` are already set to `DR` placeholders. `x` is a `t` output value from the previous neuron. How the `conv2d` operation is implemented in Algodiff then?
  86. ```ocaml
  87. and conv2d ?padding a b s =
  88. let ff a b =
  89. match a, b with
  90. | Arr a, Arr b -> Arr A.(conv2d ?padding a b s)
  91. | _ -> error_binop "conv2d" a b
  92. in
  93. let fd a b = conv2d ?padding a b s in
  94. ...
  95. let r_d_c a b = Conv2D_D_C (a, b, s) in
  96. op_d_d_d a b ff fd _ _ _ _ r_d_c _
  97. ```
  98. Here `a` and `b` are input and kernel respectively, both of type `t`. For simplicity, we only look at the case where input is `DR` and `Kernel` is a constant `Arr`:
  99. (*One question: wait, you just said that the `l.w` is already set to `DR` in the previous step, how then could it be `Arr` now? What has changed it?*)
  100.  
  101. ```ocaml
  102. and op_d_d_d a b ff fd df_da df_db df_dab r_d_d r_d_c r_c_d =
  103. match a, b with
  104. | DR (ap, _, _, _, ai), Arr _bp -> let cp = fd ap b in DR (cp, ref (zero cp), r_d_c a b, ref 0, ai)
  105. |...
  106. ```
  107.  
  108. So what `Maths.conv2d` does is this: first calculate the result value by updating the existing primal value of a DR (`let cp = fd ap b` => `let cp = conv2d ?padding ap b s`), ignore the gradient value (just set to zero: `ref (zero cp)`), set the `Noop` operation to `Conv2D_D_C` (`r_d_c a b`), and set the tag as is (`ai`).
  109.  
  110. *What I don't quite understand is function `fd`; it calls it self, why? I temporarily interpret it as "calculating result value", but it is quite likely wrong.*
  111.  
  112. The translation of Type B neuron is similar. For example, for the `Maxpool2D` neuron:
  113.  
  114. ```ocaml
  115. let run x l = Maths.(max_pool2d l.padding x l.kernel l.stride)
  116.  
  117. and max_pool2d padding a b s =
  118. let ff = function
  119. | Arr a -> Arr A.(max_pool2d ~padding a b s)
  120. | _ -> error_uniop "max_pool2d" a
  121. in
  122. let fd a = max_pool2d padding a b s in
  123. let df _cp _ap _at = failwith "max_pool2d:df" in
  124. let r a = Maxpool2D_D (a, padding, b, s) in
  125. op_d_d a ff fd df r
  126.  
  127. and op_d_d a ff fd df r =
  128. match a with
  129. | DF (ap, at, ai) -> ...
  130. | DR (ap, _, _, _, ai) -> let cp = fd ap in DR (cp, ref (zero cp), r a, ref 0, ai)
  131. | ap -> ff ap
  132. ```
  133.  
  134. If the input is `DR`, then this operation similarly adds a `DR` node to the graph; otherwise the `ff` is called, and a `Arr` node is added.
  135.  
  136.  
  137. After a forward pass is finished, we get one output `DR` value. But it's much more than an output value; it actually contains a whole computation graph in its `op`:
  138.  
  139. ```ocaml
  140. and trace_op =
  141. | Conv1D_D_C of t * t * int array
  142. | Maxpool2D_D of t * padding * int array * int array
  143. ...
  144.  
  145. ```
  146. The output of each `run` function is accumulated in this way into a graph of `t`s.
  147.  
  148. 3. The final step `mkpar nn` is simple: return the parameters of each layer in an array, which is `t array array`.
  149.  
  150.  
  151. ## Backward Phase
  152.  
  153. We have already get the graph in the output `DR` value `y` from forward pass; now let's apply backward step on it. Note that the backward step is actually applied on a `loss` value, which append some extra nodes at the end of the output graph from forward pass, but let's ignore it for now.
  154.  
  155. Starting from `reverse_prop (_f 1.) y`, which simply comprises of two steps:
  156.  
  157. ```ocaml
  158. let reverse_prop v x =
  159. reverse_reset x;
  160. reverse_push v x
  161.  
  162. let reverse_reset x =
  163. let rec reset xs =
  164. match xs with
  165. | [] -> ()
  166. | x :: t ->
  167. | DR (_ap, aa, ao, af, _ai) -> (
  168. aa := reset_zero !aa;
  169. match ao with
  170. | Noop -> reset t
  171. ....)
  172. else reset t
  173. )
  174. | _ -> reset t
  175. in
  176. reset [x]
  177.  
  178. let reverse\_push v x =
  179. let open Maths in
  180. let rec push xs =
  181. match xs with
  182. | [] -> ()
  183. | (v, x) :: t -> (
  184. match x with
  185. | DR (ap, aa, ao, af, _ai) -> (
  186. aa := Maths.(!aa + v);
  187. match ao with
  188. | Noop -> push t
  189. | Conv2D_D_C (a, b, s) -> push ((conv2d_backward_input a b s !aa, a) :: t)
  190. | ....
  191. )
  192. | _ -> push t
  193. )
  194. in
  195. push [(v, x)]
  196. ```
  197.  
  198. 1. No magic for the `reverse_reset` function. Starting from the root node `y`, for each node: 1) set my own gradient to 0; 2) add my parents to the Stack; 3) process the first element of the Stack until it is empty.
  199.  
  200. 2. `reverse_push` is little bit complex but similar. Starting from `(v, y)`, where v for `y` is 1, for each node: 1) update my gradient by adding my current gradient with `v`; 2) calculate the `v` for my parents; 3) add `(v, parent)` to the Stack; 4) process the first element of the Stack until the it is empty.
  201. In both steps, if the node is not a `DR`, then just ignore it.
  202.  
  203. 3. After one backward step, the gradient of each node is updated.
  204.  
  205. The rest is easy: `mkpri nn, mkadj nn`: get the weight value and gradient of each node in arrays if it contains any.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement