Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- 7b LLM pre-training cost: $200,000 - MPT-7b 2048 ctx length 1T tokens - matches llama 7b performance
- https://www.databricks.com/blog/mpt-7b
- Use 4090 instead of A100 (GaLore + fp8 with fp32 acc (unit scaling paper) or fp16 with fp16 acc (using pairwise accumulation for reduced error) gives 330 tflops) -> 1/5 cost on vast.ai compared to mpt baseline of $2 per hour per gpu (calculated from their publicly released data) -> $40,000
- https://arxiv.org/abs/2403.03507v1
- https://arxiv.org/abs/2303.11257
- https://discord.com/channels/1068976834382925865/1219897318384472064/1220080285190979705
- Use fp8 with fp16 accumulate (combine unit scaling and pairwise accumulation (potentially impossible)) -> 2x boost to 660 tflops -> $20,000
- ** same references as above **
- Flash attention 2 -> 2x improvement compared to original flash attention -> $10,000
- https://tridao.me/publications/flash2/flash2.pdf
- Reduce context length to 128 (plus throw in sequence packing, larger batch sizes) then post-hoc scale back up to 8k+ using growlength style training or entropy-abf -> 2x+ reduction? -> $5,000
- https://arxiv.org/abs/2310.00576
- https://arxiv.org/abs/2401.07004
- Training on a single node -> ~2x boost due to reduced communication overhead -> $2,500
- https://arxiv.org/pdf/2311.00257v2.pdf (just look at the drop in MFU from multi-node training)
- Reducing dataset size from 5tb to ~1tb while maintaining performance with a minicpm-style training recipe -> 5x reduction -> $500
- https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20
- https://arxiv.org/pdf/2203.03466.pdf
- LAWA high lr early weight averaging -> ~20% faster convergence -> $400
- https://arxiv.org/pdf/2306.03241.pdf
- Google brain paper: transcending scaling laws with 0.1% extra compute paper -> 2x savings -> $200
- https://arxiv.org/abs/2210.11399
Advertisement
Add Comment
Please, Sign In to add comment