Controllable Speech Synthesis summary

Semi-Supervised Generative Modeling for Controllable Speech Synthesis

This model is trained with partial supervision.
* This is a sequence-to-sequence model with an encoder and decoder.
* Encoder network:
   * The input to the encoder is a sequence of phonemes.
   * Each phoneme is represented by a 256-dimensional embedding. They do not specify how they created these phoneme embeddings.
   * The phoneme input is passed through a pre-net of two fully connected ReLU layers (256 units, 128 units). A dropout rate of 0.5 is applied to each layer.
   * The dense output is passed through a CBHG text encoder to create a sequence of text embeddings.
   * The text embeddings are augmented with a learned speaker embedding (not described) and the variational posteriors (described below).
* CBHG text encoder:
   * * Decoder network:
   * The input to the Attention RNN is (1) a context vector, (2) recursive input from the same layer, and (3) recursive input from the decoder output.
      * (1) An attention module with 128 hidden tanh units, a 5-component GMM attention window, and softmax weighting over encoder outputs.
      * (2) An LSTM with 256 units and 0.1 zoneout.
      * (3) A pre-network that’s a copy of the encoder network with a separate set of parameters. The input to this pre-network is described later.
   * The output of the Attention RNN is passed through the Decoder RNN.
      * The Decoder RNN consists of two LSTM layers, each has 256 units zoneout 0.1 with residual connections between layers.
   * The output of the Decoder RNN is passed through a linear layer. Each output of this linear layer produces 2 mel spectrogram frames.
   * The second of these frames is passed to the next decoder step.
      * During training, the ground truth frame is passed to the decoder instead of the output frame.
      * Before being passed to the decoder, this frame is passed the pre-network of (3) above.
   * The attention context is recalculated on each decoder step.
* Variational posteriors network:
   * The input to the variational posteriors network is a mel spectrogram.
   * The mel spectrogram is passed through 6 convolutional layers. Each last ReLU nonlinearity, 3x3 filters, 2x2 stride, and batch normalization. They have 32, 32, 64, 64, 128, and 128 filters respectively.
   * The output of the convolutions is passed through a unidirectional LSTM with 128 units.
   * The output of the LSTM is augmented with the final encoder LSTM output.
   * The combined vector is passed through a MLP with 128 tanh hidden units.
   * The output of the MLP consists of the parameters to a diagonal Gaussian distribution. The output of the Gaussian distribution corresponds to the supervised training labels.
      * For supervised labels, the output is a categorical distribution with just a mean.
   * A single random sample is taken from the Gaussian distribution and concatenated with the output of the encoder network.
* The mel spectrogram frames are converted to waveforms through using WaveRNN.