TT2 Notebook Changelog/Personal Log

V1.0
- It exists
- Audio source trimmed & converted to 22050Hz (thanks synthbot)
- Added alignment graph and progress bars
V1.0.1
- Added toggle for filtering out Noisy Clips (Rainbow Likes to flap her wings a lot)
V1.0.2
- Ability to Filter in emotions
- more obvious/clear setting for picking the name of the saved model.
- config for graph resolution in Colab.
V1.1.0 (Memory Leak, otherwise working)
- 48Khz pretrained model (MMI).
- CTC/MMI Objectives from https://github.com/bfs18/tacotron2 for more robust alignments (and ensure model is still compatible with the original Synthesis code).
- Show training loss with TQDM.
V1.2.0 (Dead)

V1.3.0 (Working on it)
- Package pre-processed spectrograms (Around 18GB), maybe that's sufficient to finally get the multi-speaker training set up on colab before the time limit?
	- need to replace naive ARPAbet transcripts with force-aligned version, or just make a pre-processing Colab notebook that uses your Google Drive to store the spectrograms. (note, 80 channel specs for 22.05Khz so size is actually 9GB).

- Clip and process Nancy original studio files into standard format with quotes. (done)
- Add Nancy corpus to Tacotron2 speaker_embeddings (done)
    - Train (Done, needs a little fine-tuning later)
    - generate GTA mels for WaveGlow. (Done, needs a little fine-tuning later)
- Train WaveGlow (512 Residual, 9 Layer) on Nancy + VCTK, compare audio quality.
    - Update 03/15/20, LL is not a good indicator of quality, volume normalisation and dataset size have much larger impacts.
    - Update 03/15/20, It continues to improve, I'm just going to let it train till I cannot hear improvement by ear.
    - Update 03/20/20, Testing Preempthasis, ReZero, Flow-Count/Conv Channel-Count with Mini-models.
    - Update 03/26/20, I can't figure out why quality is still bad. Not much left that it could be now.
	  I'll be trying a few last bits;
		1. Training a very large model, for a long time (Done)
		2. Only training on Nancy dataset (Done)
		3. Testing n_group independantly (Done)
		   smaller n_group takes longer to train but perform significantly better.
           Very large values cause a lot of low frequency noise. Values under 8 do not train in a reasonable time.
           https://github.com/NVIDIA/DeepLearningExamples/issues/319#issuecomment-606305888
	- Update 04/16/20, Setup a util to reshape old cond_layer weights, meaning I can add speaker_embeddings to any of my previous models without resetting/losing any performance. Now just the waiting game!
		1. Added speaker_embeddings to 48 Flow model, minor training required (Done, this model still lacks high frequency detail, though better than before and Applejack found large improvements)
		2. Also modifying WaveGlow into WaveFlow, though I suspect there's already a PyTorch implementation, it seems like fun. (In Progress)
			https://github.com/PaddlePaddle/Parakeet/blob/develop/parakeet/models/waveflow/waveflow_modules.py
			~~WaveFlow is working, starting hparam optimization and testing.~~
			WaveFlow isn't working fully. The model can learn when to be loud/quiet however there it is all white noise.
		3.  I'd like to find a better way to input speaker-specific information into the models. Concatenating speaker embeddings to the cond_layer inputs improves quality of converged models but higher frequencies still appear washed out.
			I also noticed a difference between the Yoyololicon and Nvidia implementations, one that I find could be very useful.
			Nvidia squeezes the mel spectrograms into [B, n_mel, n_group, T//n_group] where each timestep being convolved and targetted has it's entire mel-spectrogram frames used. Yoyololicon uses [B, n_mel, T//n_group] where a single mel-spec frame is used to convolve the entire n_group.
			The difference in audio quality is not measurable and further there is a post from someone who used nearest-neighbour upsampling with WaveGlow and found quality to still be satifactory, therefore I propose upsampling the spectrogram in height using speaker_embeddings to provide more information to WN and make the speaker information mix better while keeping performance (Nvidia inputs n_group redundant spectrograms so I'll drop those and see if I can fit this new upsampling method in without losing performance).
			I can further use L1/L2 loss on the input spectrogram and upsampled spectrogram to provide faster training or even train the upsampling layer seperately from the rest of the model, allowing 48Khz and 22Khz Colab models to interact with the new WaveGlow model without having to retrain either the Colab models or the WaveGlow.
			The MelNet upsampling looks very high quality so I've stolen that and will see if it could work (I'll need a place to input speaker embeddings that's better than the cond_layer for this to be worthwhile). I could also upsample along the time axis though WaveGlow already seems to have enough time-detail in the spectrograms if the outputs are any indication.
			Update 04/20/20, MelNet requires more VRAM than I'm comfortable using for upsampling (2-8GB). I'll test 320 channel and 80 channel spectrograms later and make sure that spectrogram detail will improve the higher frequencies verses say increasing n_channels further or using a fully causal autoregressive system.
		4. Update 04/21/20, In-order to mix speaker_embedding better I've added a cond_hidden_dim and n_cond_layers param. Increasing the convolutions performed on the spectrogram with 3 layers and 256 hidden_dim decreased training time by a factor 3 (20k iters has comparable performance to 60k iters on single cond_layer and time-per-iter increases by about 6%). That's with the mini-model. I also notice that the mini-model still has trouble with higher frequencies.
			I'll test training from scratch on one of the previous larger model configs and see if decreased training time holds up with larger models as well (though in a couple of hours, this is the best performing mini-model so far and I want to get too say 150K iters before killing it).
			edit: Huzzah! I may be speaking too soon but I think the mini-model has more-or-less converged after 130K iters at 2e-3 LR. Much faster than other models. Surprisingly fast compared to the past, but no complaints from me. I'll compare the spectrograms from 2e-3 LR converged vs 1e-3... etc etc and see which qualities of the spectrogram/audio are controlled by LR.
			edit2: My ReduceLRonPlateau was still running. I think my reproducability just got thrown out the window.
			edit3: Logs still got the info, LR dropped at 72K and training time is still 3x faster than normal for the timeframe before LR changed (and looks like both models converged on similar loss at 2e-3, so this modification mainly speeds up training time).
			edit4: Started training Baseline Stacked Convs. Crossed fingers!
		5. Update 04/24/20, Stacked Convs full model has developed artifacts near the end of inferred audio. This wouldn't be a problem if it weren't messing with my metrics! I'll do a quick rerun and see if this is something to do with the run or the code then potentially apply a padding to the end of inference spectrograms that I can trim out later for evaluation.
		6. Update 04/24/20, Retesting with artifacts removed, so far performance slightly beats previous best model. Also note for later, check smaller window_lengths since from what I can tell, something as low as 1024 samples seems to be sufficient and lower samples would provide more local information for the model to work with (at the cost of frequency information, which seems less relevant with stacked convs).
		7. Update 04/25/20, Best spect MSE ever seen by me at 20K. Effectively beats 40K stacked conv mini-model, 60K 48 Flow model and 70K 512 n_channels in terms of mel-spect MSE.
		   edit: Though I don't think it sounds as good as 60K 48 Flow or 70K 512 n_channels yet. I'll need to look for more human-relevant metrics.
		   edit2: Re-listening to mini-model, there is a VERY clear improvement in quality when using stacked convs at 20K, at 120K the models are very similar (the stacked convs does actually sound better by ear, but it's the removal of lower-freq sound artifacts rather than improved higher frequencies).
		8. Update 04/28/20, Tested a mini-model with f_mel = 1000, f_max = 7000. The model was able to learn harmonics (mostly) under 1000Hz despite not having that information explicitly stated, suggesting that the model is receiving sufficient information from the spectrogram and is able to infer details hidden by the low resolution. I've also started testing 1200 sample windows and got the gradients for Global Speaker embeddings sorted out. I'll also set up the reassigned spectrograms in a bit which may provide further improvements.
		9. 05/03/20, Massive rewrite of the data processing. Datasets have been highpassed, lowpassed, volume normalized, denoised, more accurately trimmed, seperated speaker ID's by audio source and made filelists intercompatible between Tacotron2 and WaveGlow and WaveFlow codes.

- Preserve Decoder state between clips. (Need to review Mozilla TTS for examples)
	Also links into;
    - Truncated minibatches with resets (aka TBPTT) https://arxiv.org/pdf/1811.07240.pdf (DONE)
      Which should allow training on audio files that are multiple minutes long. (DONE)
		- http://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/#Data-parallel_and_distributed-data-parallel
		It's also extra easy to predict the pattern so I should be able to make TBPTT work with the map-based dataset loader and distributed loaders no issue.

    - Merging neighbouring mlp clips to create longer clips with more consistent emotion and more examples for the model of commas and periods in action. (Sounds very good in initial tests, need to try merging 3+ quotes and make a system for more than 2 speakers)
 	  https://desuarchive.org/mlp/thread/35129047/#p35142297
	- Train Tacotron2 on multiple speakers within the same clips.
		Methods that might work;
		- Speaker attention network - Infinite number of speakers
		- Stop_token but for speakers - Allows more manual control, but prone to debouncing style issues if using more than 2 speakers
		- ?

- Update my torchMoji and TPCW-GST architectures to support TPSE-GST. https://arxiv.org/pdf/1808.01410.pdf Found better results when predicting Style Embeddings rather than Style Tokens.
- Investigate and remove dataset bias's for clip length
    - Looks like GST can fix this problem, Yay!
- G2p (Grapheme2Phoneme) or Equivalent to provide better pronunciation during inference. (Tested, needs the merged.dict to be updated a bit)
- Batch Inference (Done?)
- Fix 'memory leak', happens when initialising model layers/weights on Google Colab
- Use exclusively relative paths
- Install Nvidia/Apex on Colab (This only relates to training scripts)
- Drop Frame Rate (Done)
- Teacher-forcing control (Done)
- GST (Done)
- Add BatchNorm option to Prenet
	https://github.com/mozilla/TTS/blob/fab74dd5be681d5fb51080515f60e1b20c6b8d40/layers/common_layers.py#L27
- Save model with best_validation accuracy separately (Done)
- Automatically find highest batch_size for model
	Either:
		- Starting iterations should included the largest clips in the dataset
		- Out of memory exception should cause the training to restart and the batch_size to decrease
	Or:
		- Get the largest clip and repeat along [B]
		- Batch_size should start at 1 and increase till exception then go back to the safe value.
	- Needs to also consider teacher-forcing will effect the computational graph
- Remove Tensorflow requirement
- Remove TensorboardX requirement
- Method of;
	- Searching clips
	- Listening to said clips
	- Loading into model
- bundle with model (Done)
	- hparams
	- best_validation loss
	- best_avg_training loss
	- amp state

Ideas
- https://github.com/mozilla/TTS
- Apply Pre-empthasis
- MMI Auxiliary Recogniser
- SV2TTS
- Mel-spec outputs for use as WaveGlow training inputs (See 3.1.1 https://arxiv.org/pdf/1712.05884.pdf).