Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- For a fixed set of model parameters \theta and D_KL > C:
- - D_KL > C implies that the term on the right is positive, so the maximiser w.r.t. \beta is pushing beta (only beta) up
- - This high beta gets fixed, so the other optimiser for the whole model now sees a positive, large number on the RHS
- - Now, with this \beta fixed, this minimising optimisation needs to drive D_KL down, in order to make the term on the RHS smaller
- For a fixed set of model parameters \theta and D_KL < C:
- - D_KL < C implies that the term on the right is negative, so the maximiser w.r.t. \beta is pushing beta down
- - This small beta gets fixed, so the ADAM optimiser sees a small negative number on the RHS
- - In order to make this even smaller, the optimiser is driving down whole equation, however: it is seeing the whole equation as one, including the decoder term!
- - By minimising the decoder loss, the KL term increases, here’s why:
- - The decoder loss is an l1 loss between synthesised spectrogram and input spectrogram. The synthesised spectrogram is conditioned on the output of the reference encoder, whose output is actually a sample from the distribution (z ~ q(z|x)). The more you train the model parameters \theta, the better outputs this q(z|x) will output, because the model encourages the encoder network to encode useful information into this sampled vector from the distribution (z ~ q(z|x)). Better outputs from q(z|x) means higher mutual information between the input data’s distribution and the approximated latent distribution I(X, Z). Eq. 7 shows, that R^AVG (which is just the KL term in eq. 9 averaged over the data) is actually a HIGHER BOUND on the mutual information I(X,Z). it is a higher bound because of eq.5-8 and appendix C1.
- What this means basically is that:
- the better the model gets (i.e. the smaller the decoder loss) ==> the higher the mutual information between the model distribution and the approximated distribution will be ==> The upper bound on the mutual Information increases ==> D_KL term in eq. 9 increases.
- A negative RHS in eq.9 just encourages the model to learn more about the input data and this in turn actually increases the KL term in the RHS - up until a capacity C, whereby this positive-negative relationship oscillates the KL term around the capacity limit.
- If D_KL is increasing because the model is learning more and better, it means that it will increase to a point where D_KL > C, and this in turn then starts the whole process described above again.
Add Comment
Please, Sign In to add comment