Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- # How I think Algodiff Works
- To train a network, we need to first use `train_generic`:
- ```ocaml
- let train_generic ?state ?params ?(init_model=true) nn x y =
- if init_model = true then init nn;
- let f = forward nn in
- let b = backward nn in
- let u = update nn in
- let s = save nn in
- let p = match params with
- | Some p -> p
- | None -> Optimise.Params.default ()
- in
- Optimise.minimise_network ?state p f b u s x y
- ```
- We can see that it is actually a wrapper around `minimise_network` :
- ```ocaml
- let minimise_network ?state params forward backward update save x y =
- ...
- let iterate i =
- let xt, yt = bach_fun x y i in
- let yt', ws = forward xt in
- let loss = loss_fun yt yt' in
- let loss = Maths.(loss / (_f (Mat.row_num yt |> float_of_int))) in
- let reg = ...
- let loss = Maths.(loss + reg) in
- let ws, gs' = backward loss in
- loss |> primal', ws, gs'
- in
- ...
- while Checkpoint.(state.stop = false) do
- let loss', ws, gs' = iterate Checkpoint.(state.current_batch) in
- ...
- done
- ```
- What are the `forward` and `backward` functions here?
- ```ocaml
- let forward nn x = mktag (tag ()) nn; run x nn, mkpar nn
- let backward nn y = reverse_prop (_f 1.) y; mkpri nn, mkadj nn
- ```
- These two operations are called at each iteration. Now we can see the process from a high level.
- ## Forward Phase
- First, differentiate between two kinds of neurons: the ones that contain weights to update, such as Conv2D, and I call them **Type A** neurons; the rest that only do calculation, such as MaxPool2D, and I call it **Type B** neurons. Hereafter I'll use "layer" and "neuron" interchangeably.
- Each neuron normally contains several *nodes*, each node has type `t`:
- ```ocaml
- type t =
- | F of A.elt (* constructor of float numbers *)
- | Arr of A.arr (* constructor of ndarrays *)
- | DF of t * t * int (* primal, tangent, tag *)
- | DR of t * t ref * trace_op * int ref * int (* primal, adjoint, op, fanout, tag *)
- ```
- For example, if the `run` function of one node is `Maths.(relu (x - _f a))`, it will generate 3 nodes. I don't think we need to use `DF` value here; I also understand the primal value of a `DR` as the value of this operation itself, and adjoint value as gradient hereafter.
- (*wait, what is `primal` anyway? weight? output value? The graph is definitely hold in `op` not here...*)
- 1. `mktag (tag ()) nn`: for each layer in a neural network, if it is Type A, change each of its parameters to a `DR` value by calling `make_reverse`; if it is a type B neuron, do nothing.
- ```ocaml
- let make_reverse p i = DR (p, ref (zero p), Noop, ref 0, i)
- ```
- Note that the `DR` values created here are ALL `Noop` operations, which means they are nothing more than placeholders at this stage.
- 2. `run x nn`: connect all the existing operations into a graph by running this network layer by layer, regardless whether it's type A or B. The whole graph is accumulated to the output node. Note that the `run` function of each neuron uses operation from `Algodiff.Maths` rather than normal math operations. Let's look an example :
- ```ocaml
- (* module Conv2D *)
- let run x l = Maths.((conv2d ~padding:l.padding x l.w l.stride) + l.b)
- ```
- Here both `l.w` and `l.b` are already set to `DR` placeholders. `x` is a `t` output value from the previous neuron. How the `conv2d` operation is implemented in Algodiff then?
- ```ocaml
- and conv2d ?padding a b s =
- let ff a b =
- match a, b with
- | Arr a, Arr b -> Arr A.(conv2d ?padding a b s)
- | _ -> error_binop "conv2d" a b
- in
- let fd a b = conv2d ?padding a b s in
- ...
- let r_d_c a b = Conv2D_D_C (a, b, s) in
- op_d_d_d a b ff fd _ _ _ _ r_d_c _
- ```
- Here `a` and `b` are input and kernel respectively, both of type `t`. For simplicity, we only look at the case where input is `DR` and `Kernel` is a constant `Arr`:
- (*One question: wait, you just said that the `l.w` is already set to `DR` in the previous step, how then could it be `Arr` now? What has changed it?*)
- ```ocaml
- and op_d_d_d a b ff fd df_da df_db df_dab r_d_d r_d_c r_c_d =
- match a, b with
- | DR (ap, _, _, _, ai), Arr _bp -> let cp = fd ap b in DR (cp, ref (zero cp), r_d_c a b, ref 0, ai)
- |...
- ```
- So what `Maths.conv2d` does is this: first calculate the result value by updating the existing primal value of a DR (`let cp = fd ap b` => `let cp = conv2d ?padding ap b s`), ignore the gradient value (just set to zero: `ref (zero cp)`), set the `Noop` operation to `Conv2D_D_C` (`r_d_c a b`), and set the tag as is (`ai`).
- *What I don't quite understand is function `fd`; it calls it self, why? I temporarily interpret it as "calculating result value", but it is quite likely wrong.*
- The translation of Type B neuron is similar. For example, for the `Maxpool2D` neuron:
- ```ocaml
- let run x l = Maths.(max_pool2d l.padding x l.kernel l.stride)
- and max_pool2d padding a b s =
- let ff = function
- | Arr a -> Arr A.(max_pool2d ~padding a b s)
- | _ -> error_uniop "max_pool2d" a
- in
- let fd a = max_pool2d padding a b s in
- let df _cp _ap _at = failwith "max_pool2d:df" in
- let r a = Maxpool2D_D (a, padding, b, s) in
- op_d_d a ff fd df r
- and op_d_d a ff fd df r =
- match a with
- | DF (ap, at, ai) -> ...
- | DR (ap, _, _, _, ai) -> let cp = fd ap in DR (cp, ref (zero cp), r a, ref 0, ai)
- | ap -> ff ap
- ```
- If the input is `DR`, then this operation similarly adds a `DR` node to the graph; otherwise the `ff` is called, and a `Arr` node is added.
- After a forward pass is finished, we get one output `DR` value. But it's much more than an output value; it actually contains a whole computation graph in its `op`:
- ```ocaml
- and trace_op =
- | Conv1D_D_C of t * t * int array
- | Maxpool2D_D of t * padding * int array * int array
- ...
- ```
- The output of each `run` function is accumulated in this way into a graph of `t`s.
- 3. The final step `mkpar nn` is simple: return the parameters of each layer in an array, which is `t array array`.
- ## Backward Phase
- We have already get the graph in the output `DR` value `y` from forward pass; now let's apply backward step on it. Note that the backward step is actually applied on a `loss` value, which append some extra nodes at the end of the output graph from forward pass, but let's ignore it for now.
- Starting from `reverse_prop (_f 1.) y`, which simply comprises of two steps:
- ```ocaml
- let reverse_prop v x =
- reverse_reset x;
- reverse_push v x
- let reverse_reset x =
- let rec reset xs =
- match xs with
- | [] -> ()
- | x :: t ->
- | DR (_ap, aa, ao, af, _ai) -> (
- aa := reset_zero !aa;
- match ao with
- | Noop -> reset t
- ....)
- else reset t
- )
- | _ -> reset t
- in
- reset [x]
- let reverse\_push v x =
- let open Maths in
- let rec push xs =
- match xs with
- | [] -> ()
- | (v, x) :: t -> (
- match x with
- | DR (ap, aa, ao, af, _ai) -> (
- aa := Maths.(!aa + v);
- match ao with
- | Noop -> push t
- | Conv2D_D_C (a, b, s) -> push ((conv2d_backward_input a b s !aa, a) :: t)
- | ....
- )
- | _ -> push t
- )
- in
- push [(v, x)]
- ```
- 1. No magic for the `reverse_reset` function. Starting from the root node `y`, for each node: 1) set my own gradient to 0; 2) add my parents to the Stack; 3) process the first element of the Stack until it is empty.
- 2. `reverse_push` is little bit complex but similar. Starting from `(v, y)`, where v for `y` is 1, for each node: 1) update my gradient by adding my current gradient with `v`; 2) calculate the `v` for my parents; 3) add `(v, parent)` to the Stack; 4) process the first element of the Stack until the it is empty.
- In both steps, if the node is not a `DR`, then just ignore it.
- 3. After one backward step, the gradient of each node is updated.
- The rest is easy: `mkpri nn, mkadj nn`: get the weight value and gradient of each node in arrays if it contains any.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement