Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Semi-Supervised Generative Modeling for Controllable Speech Synthesis
- This model is trained with partial supervision.
- * This is a sequence-to-sequence model with an encoder and decoder.
- * Encoder network:
- * The input to the encoder is a sequence of phonemes.
- * Each phoneme is represented by a 256-dimensional embedding. They do not specify how they created these phoneme embeddings.
- * The phoneme input is passed through a pre-net of two fully connected ReLU layers (256 units, 128 units). A dropout rate of 0.5 is applied to each layer.
- * The dense output is passed through a CBHG text encoder to create a sequence of text embeddings.
- * The text embeddings are augmented with a learned speaker embedding (not described) and the variational posteriors (described below).
- * CBHG text encoder:
- * * Decoder network:
- * The input to the Attention RNN is (1) a context vector, (2) recursive input from the same layer, and (3) recursive input from the decoder output.
- * (1) An attention module with 128 hidden tanh units, a 5-component GMM attention window, and softmax weighting over encoder outputs.
- * (2) An LSTM with 256 units and 0.1 zoneout.
- * (3) A pre-network that’s a copy of the encoder network with a separate set of parameters. The input to this pre-network is described later.
- * The output of the Attention RNN is passed through the Decoder RNN.
- * The Decoder RNN consists of two LSTM layers, each has 256 units zoneout 0.1 with residual connections between layers.
- * The output of the Decoder RNN is passed through a linear layer. Each output of this linear layer produces 2 mel spectrogram frames.
- * The second of these frames is passed to the next decoder step.
- * During training, the ground truth frame is passed to the decoder instead of the output frame.
- * Before being passed to the decoder, this frame is passed the pre-network of (3) above.
- * The attention context is recalculated on each decoder step.
- * Variational posteriors network:
- * The input to the variational posteriors network is a mel spectrogram.
- * The mel spectrogram is passed through 6 convolutional layers. Each last ReLU nonlinearity, 3x3 filters, 2x2 stride, and batch normalization. They have 32, 32, 64, 64, 128, and 128 filters respectively.
- * The output of the convolutions is passed through a unidirectional LSTM with 128 units.
- * The output of the LSTM is augmented with the final encoder LSTM output.
- * The combined vector is passed through a MLP with 128 tanh hidden units.
- * The output of the MLP consists of the parameters to a diagonal Gaussian distribution. The output of the Gaussian distribution corresponds to the supervised training labels.
- * For supervised labels, the output is a categorical distribution with just a mean.
- * A single random sample is taken from the Gaussian distribution and concatenated with the output of the encoder network.
- * The mel spectrogram frames are converted to waveforms through using WaveRNN.
Advertisement
Add Comment
Please, Sign In to add comment