Synthbot

Controllable Speech Synthesis summary

Oct 22nd, 2019
203
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.98 KB | None | 0 0
  1. Semi-Supervised Generative Modeling for Controllable Speech Synthesis
  2.  
  3. This model is trained with partial supervision.
  4. * This is a sequence-to-sequence model with an encoder and decoder.
  5. * Encoder network:
  6. * The input to the encoder is a sequence of phonemes.
  7. * Each phoneme is represented by a 256-dimensional embedding. They do not specify how they created these phoneme embeddings.
  8. * The phoneme input is passed through a pre-net of two fully connected ReLU layers (256 units, 128 units). A dropout rate of 0.5 is applied to each layer.
  9. * The dense output is passed through a CBHG text encoder to create a sequence of text embeddings.
  10. * The text embeddings are augmented with a learned speaker embedding (not described) and the variational posteriors (described below).
  11. * CBHG text encoder:
  12. * * Decoder network:
  13. * The input to the Attention RNN is (1) a context vector, (2) recursive input from the same layer, and (3) recursive input from the decoder output.
  14. * (1) An attention module with 128 hidden tanh units, a 5-component GMM attention window, and softmax weighting over encoder outputs.
  15. * (2) An LSTM with 256 units and 0.1 zoneout.
  16. * (3) A pre-network that’s a copy of the encoder network with a separate set of parameters. The input to this pre-network is described later.
  17. * The output of the Attention RNN is passed through the Decoder RNN.
  18. * The Decoder RNN consists of two LSTM layers, each has 256 units zoneout 0.1 with residual connections between layers.
  19. * The output of the Decoder RNN is passed through a linear layer. Each output of this linear layer produces 2 mel spectrogram frames.
  20. * The second of these frames is passed to the next decoder step.
  21. * During training, the ground truth frame is passed to the decoder instead of the output frame.
  22. * Before being passed to the decoder, this frame is passed the pre-network of (3) above.
  23. * The attention context is recalculated on each decoder step.
  24. * Variational posteriors network:
  25. * The input to the variational posteriors network is a mel spectrogram.
  26. * The mel spectrogram is passed through 6 convolutional layers. Each last ReLU nonlinearity, 3x3 filters, 2x2 stride, and batch normalization. They have 32, 32, 64, 64, 128, and 128 filters respectively.
  27. * The output of the convolutions is passed through a unidirectional LSTM with 128 units.
  28. * The output of the LSTM is augmented with the final encoder LSTM output.
  29. * The combined vector is passed through a MLP with 128 tanh hidden units.
  30. * The output of the MLP consists of the parameters to a diagonal Gaussian distribution. The output of the Gaussian distribution corresponds to the supervised training labels.
  31. * For supervised labels, the output is a categorical distribution with just a mean.
  32. * A single random sample is taken from the Gaussian distribution and concatenated with the output of the encoder network.
  33. * The mel spectrogram frames are converted to waveforms through using WaveRNN.
Advertisement
Add Comment
Please, Sign In to add comment