bringing down the cost of pretraining llama to $200

7b LLM pre-training cost: $200,000 - MPT-7b 2048 ctx length 1T tokens - matches llama 7b performance
https://www.databricks.com/blog/mpt-7b

Use 4090 instead of A100 (GaLore + fp8 with fp32 acc (unit scaling paper) or fp16 with fp16 acc (using pairwise accumulation for reduced error) gives 330 tflops) -> 1/5 cost on vast.ai compared to mpt baseline of $2 per hour per gpu (calculated from their publicly released data) -> $40,000
https://arxiv.org/abs/2403.03507v1
https://arxiv.org/abs/2303.11257
https://discord.com/channels/1068976834382925865/1219897318384472064/1220080285190979705

Use fp8 with fp16 accumulate (combine unit scaling and pairwise accumulation (potentially impossible)) -> 2x boost to 660 tflops -> $20,000
** same references as above **

Flash attention 2 -> 2x improvement compared to original flash attention -> $10,000
https://tridao.me/publications/flash2/flash2.pdf

Reduce context length to 128 (plus throw in sequence packing, larger batch sizes) then post-hoc scale back up to 8k+ using growlength style training or entropy-abf -> 2x+ reduction? -> $5,000
https://arxiv.org/abs/2310.00576
https://arxiv.org/abs/2401.07004

Training on a single node -> ~2x boost due to reduced communication overhead -> $2,500
https://arxiv.org/pdf/2311.00257v2.pdf (just look at the drop in MFU from multi-node training)

Reducing dataset size from 5tb to ~1tb while maintaining performance with a minicpm-style training recipe -> 5x reduction -> $500
https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20
https://arxiv.org/pdf/2203.03466.pdf

LAWA high lr early weight averaging -> ~20% faster convergence -> $400
https://arxiv.org/pdf/2306.03241.pdf

Google brain paper: transcending scaling laws with 0.1% extra compute paper -> 2x savings -> $200
https://arxiv.org/abs/2210.11399