sskadam

Capacitron Intuition

Apr 19th, 2021 (edited)
53
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.48 KB | None | 0 0
  1. For a fixed set of model parameters \theta and D_KL > C:
  2.  
  3. - D_KL > C implies that the term on the right is positive, so the maximiser w.r.t. \beta is pushing beta (only beta) up
  4.  
  5. - This high beta gets fixed, so the other optimiser for the whole model now sees a positive, large number on the RHS
  6.  
  7. - Now, with this \beta fixed, this minimising optimisation needs to drive D_KL down, in order to make the term on the RHS smaller
  8.  
  9.  
  10. For a fixed set of model parameters \theta and D_KL < C:
  11.  
  12. - D_KL < C implies that the term on the right is negative, so the maximiser w.r.t. \beta is pushing beta down
  13.  
  14. - This small beta gets fixed, so the ADAM optimiser sees a small negative number on the RHS
  15.  
  16. - In order to make this even smaller, the optimiser is driving down whole equation, however: it is seeing the whole equation as one, including the decoder term!
  17.  
  18. - By minimising the decoder loss, the KL term increases, here’s why:
  19.  
  20. - The decoder loss is an l1 loss between synthesised spectrogram and input spectrogram. The synthesised spectrogram is conditioned on the output of the reference encoder, whose output is actually a sample from the distribution (z ~ q(z|x)). The more you train the model parameters \theta, the better outputs this q(z|x) will output, because the model encourages the encoder network to encode useful information into this sampled vector from the distribution (z ~ q(z|x)). Better outputs from q(z|x) means higher mutual information between the input data’s distribution and the approximated latent distribution I(X, Z). Eq. 7 shows, that R^AVG (which is just the KL term in eq. 9 averaged over the data) is actually a HIGHER BOUND on the mutual information I(X,Z). it is a higher bound because of eq.5-8 and appendix C1.
  21.  
  22. What this means basically is that:
  23.  
  24. the better the model gets (i.e. the smaller the decoder loss) ==> the higher the mutual information between the model distribution and the approximated distribution will be ==> The upper bound on the mutual Information increases ==> D_KL term in eq. 9 increases.
  25.  
  26.  
  27. A negative RHS in eq.9 just encourages the model to learn more about the input data and this in turn actually increases the KL term in the RHS - up until a capacity C, whereby this positive-negative relationship oscillates the KL term around the capacity limit.
  28.  
  29.  
  30. If D_KL is increasing because the model is learning more and better, it means that it will increase to a point where D_KL > C, and this in turn then starts the whole process described above again.
Add Comment
Please, Sign In to add comment