Capacitron Intuition

For a fixed set of model parameters \theta and D_KL > C:

- D_KL > C implies that the term on the right is positive, so the maximiser w.r.t. \beta is pushing beta (only beta) up

- This high beta gets fixed, so the other optimiser for the whole model now sees a positive, large number on the RHS

- Now, with this \beta fixed, this minimising optimisation needs to drive D_KL down, in order to make the term on the RHS smaller


For a fixed set of model parameters \theta and D_KL < C:

- D_KL < C implies that the term on the right is negative, so the maximiser w.r.t. \beta is pushing beta down

- This small beta gets fixed, so the ADAM optimiser sees a small negative number on the RHS

- In order to make this even smaller, the optimiser is driving down whole equation, however: it is seeing the whole equation as one, including the decoder term!

- By minimising the decoder loss, the KL term increases, here’s why:

- The decoder loss is an l1 loss between synthesised spectrogram and input spectrogram. The synthesised spectrogram is conditioned on the output of the reference encoder, whose output is actually a sample from the distribution (z ~ q(z|x)). The more you train the model parameters \theta, the better outputs this q(z|x) will output, because the model encourages the encoder network to encode useful information into this sampled vector from the distribution (z ~ q(z|x)). Better outputs from q(z|x) means higher mutual information between the input data’s distribution and the approximated latent distribution I(X, Z). Eq. 7 shows, that R^AVG (which is just the KL term in eq. 9 averaged over the data) is actually a HIGHER BOUND on the mutual information I(X,Z). it is a higher bound because of eq.5-8 and appendix C1.

What this means basically is that:

the better the model gets (i.e. the smaller the decoder loss) ==>  the higher the mutual information between the model distribution and the approximated distribution will be ==> The upper bound on the mutual Information increases ==> D_KL term in eq. 9 increases.


A negative RHS in eq.9 just encourages the model to learn more about the input data and this in turn actually increases the KL term in the RHS - up until a capacity C, whereby this positive-negative relationship oscillates the KL term around the capacity limit.


If D_KL is increasing because the model is learning more and better, it means that it will increase to a point where D_KL > C, and this in turn then starts the whole process described above again.