Untitled

# How I think Algodiff Works

To train a network, we need to first use `train_generic`:

```ocaml
let train_generic ?state ?params ?(init_model=true) nn x y =
    if init_model = true then init nn;
    let f = forward nn in
    let b = backward nn in
    let u = update nn in
    let s = save nn in
    let p = match params with
      | Some p -> p
      | None   -> Optimise.Params.default ()
    in
    Optimise.minimise_network ?state p f b u s x y
```

We can see that it is actually a wrapper around `minimise_network` :

```ocaml
let minimise_network ?state params forward backward update save x y =
    ...
    let iterate i =
      let xt, yt = bach_fun x y i in
      let yt', ws = forward xt in
      let loss = loss_fun yt yt' in
      let loss = Maths.(loss / (_f (Mat.row_num yt |> float_of_int))) in
      let reg = ...
      let loss = Maths.(loss + reg) in
      let ws, gs' = backward loss in
      loss |> primal', ws, gs'
    in

    ...

    while Checkpoint.(state.stop = false) do
      let loss', ws, gs' = iterate Checkpoint.(state.current_batch) in
      ...
    done
```

What are the `forward` and `backward` functions here?

```ocaml
let forward nn x = mktag (tag ()) nn; run x nn, mkpar nn

let backward nn y = reverse_prop (_f 1.) y; mkpri nn, mkadj nn
```

These two operations are called at each iteration. Now we can see the process from a high level.

## Forward Phase

First, differentiate between two kinds of neurons: the ones that contain weights to update, such as Conv2D, and I call them **Type A** neurons; the rest that only do calculation, such as MaxPool2D, and I call it **Type B** neurons. Hereafter I'll use "layer" and "neuron" interchangeably.
Each neuron normally contains several *nodes*, each node has type `t`:

```ocaml
type t =
  | F   of A.elt                                  (* constructor of float numbers *)
  | Arr of A.arr                                  (* constructor of ndarrays *)
  | DF  of t * t * int                            (* primal, tangent, tag *)
  | DR  of t * t ref * trace_op * int ref * int   (* primal, adjoint, op, fanout, tag *)
```
For example, if the `run` function of one node is `Maths.(relu (x - _f a))`, it will generate 3 nodes. I don't think we need to use `DF` value here; I also understand the primal value of a `DR` as the value of this operation itself, and adjoint value as gradient hereafter.

(*wait, what is `primal` anyway? weight? output value? The graph is definitely hold in `op` not here...*)


1. `mktag (tag ()) nn`: for each layer in a neural network, if it is Type A, change each of its parameters to a `DR` value by calling `make_reverse`; if it is a type B neuron, do nothing.

```ocaml
let make_reverse p i = DR (p, ref (zero p), Noop, ref 0, i)
```

 Note that the `DR` values created here are ALL `Noop` operations, which means they are nothing more than placeholders at this stage.

2. `run x nn`: connect all the existing operations into a graph by running this network layer by layer, regardless whether it's type A or B. The whole graph is accumulated to the output node. Note that the `run` function of each neuron uses operation from `Algodiff.Maths` rather than normal math operations. Let's look an example :

```ocaml
(* module Conv2D *)
let run x l = Maths.((conv2d ~padding:l.padding x l.w l.stride) + l.b)
```

Here both `l.w` and `l.b` are already set to `DR` placeholders. `x` is a `t` output value from the previous neuron. How the `conv2d` operation is implemented in Algodiff then?
```ocaml
and conv2d ?padding a b s =
  let ff a b =
    match a, b with
    | Arr a, Arr b -> Arr A.(conv2d ?padding a b s)
    | _            -> error_binop "conv2d" a b
  in
  let fd a b = conv2d ?padding a b s in
  ...
  let r_d_c a b = Conv2D_D_C (a, b, s) in
  op_d_d_d a b ff fd _ _ _ _ r_d_c _
```
Here `a` and `b` are input and kernel respectively, both of type `t`. For simplicity, we only look at the case where input is `DR` and `Kernel` is a constant `Arr`:
(*One question: wait, you just said that the `l.w` is already set to `DR` in the previous step, how then could it be `Arr` now? What has changed it?*)

```ocaml
and op_d_d_d a b ff fd df_da df_db df_dab r_d_d r_d_c r_c_d =
  match a, b with
  | DR (ap, _, _, _, ai), Arr _bp -> let cp = fd ap b in DR (cp, ref (zero cp), r_d_c a b, ref 0, ai)
  |...
```

So what `Maths.conv2d` does is this: first calculate the result value by updating the existing primal value of a DR (`let cp = fd ap b` => `let cp = conv2d ?padding ap b s`), ignore the gradient value (just set to zero: `ref (zero cp)`), set the `Noop` operation to `Conv2D_D_C` (`r_d_c a b`), and set the tag as is (`ai`).

*What I don't quite understand is function `fd`; it calls it self, why? I temporarily interpret it as "calculating result value", but it is quite likely wrong.*

The translation of Type B neuron is similar. For example, for the `Maxpool2D` neuron:

```ocaml
let run x l = Maths.(max_pool2d l.padding x l.kernel l.stride)

and max_pool2d padding a b s =
  let ff = function
    | Arr a    -> Arr A.(max_pool2d ~padding a b s)
    | _        -> error_uniop "max_pool2d" a
  in
  let fd a = max_pool2d padding a b s in
  let df _cp _ap _at = failwith "max_pool2d:df" in
  let r a = Maxpool2D_D (a, padding, b, s) in
  op_d_d a ff fd df r

and op_d_d a ff fd df r =
  match a with
  | DF (ap, at, ai)      -> ...
  | DR (ap, _, _, _, ai) -> let cp = fd ap in DR (cp, ref (zero cp), r a, ref 0, ai)
  | ap                   -> ff ap
```

If the input is `DR`, then this operation similarly adds a `DR` node to the graph; otherwise the `ff` is called, and a `Arr` node is added.


After a forward pass is finished, we get one output `DR` value. But it's much more than an output value; it actually contains a whole computation graph in its `op`:

```ocaml
and trace_op =
  | Conv1D_D_C  of t * t * int array
  | Maxpool2D_D of t * padding * int array * int array
  ...

```
The output of each `run` function is accumulated in this way into a graph of `t`s.

3. The final step `mkpar nn` is simple: return the parameters of each layer in an array, which is `t array array`.


## Backward Phase

We have already get the graph in the output `DR` value `y` from forward pass; now let's apply backward step on it. Note that the backward step is actually applied on a `loss` value, which append some extra nodes at the end of the output graph from forward pass, but let's ignore it for now.

Starting from `reverse_prop (_f 1.) y`, which simply comprises of two steps:

```ocaml
let reverse_prop v x =
  reverse_reset x;
  reverse_push v x

let reverse_reset x =
  let rec reset xs =
    match xs with
    | [] -> ()
    | x :: t ->
      | DR (_ap, aa, ao, af, _ai) -> (
        aa := reset_zero !aa;
        match ao with
        | Noop -> reset t
          ....)
        else reset t
        )
      | _ -> reset t
  in
  reset [x]

let reverse\_push v x =
  let open Maths in
  let rec push xs =
    match xs with
    | []          -> ()
    | (v, x) :: t -> (
        match x with
        | DR (ap, aa, ao, af, _ai) -> (
            aa := Maths.(!aa + v);
            match ao with
            | Noop -> push t
            | Conv2D_D_C (a, b, s)  -> push ((conv2d_backward_input a b s !aa, a) :: t)
            | ....
          )
        | _ -> push t
      )
  in
  push [(v, x)]
```

1. No magic for the `reverse_reset` function. Starting from the root node `y`, for each node: 1) set my own gradient to 0; 2) add my parents to the Stack; 3) process the first element of the Stack until it is empty.

2. `reverse_push` is little bit complex but similar. Starting from `(v, y)`, where v for `y` is 1, for each node: 1) update my gradient by adding my current gradient with `v`; 2) calculate the `v` for my parents; 3) add `(v, parent)` to the Stack; 4) process the first element of the Stack until the it is empty.
  In both steps, if the node is not a `DR`, then just ignore it.

3. After one backward step, the gradient of each node is updated.

The rest is easy: `mkpri nn, mkadj nn`: get the weight value and gradient of each node in arrays if it contains any.