Guest User

bringing down the cost of pretraining llama to $200

a guest
Apr 2nd, 2024
30
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 1.86 KB | None | 0 0
  1. 7b LLM pre-training cost: $200,000 - MPT-7b 2048 ctx length 1T tokens - matches llama 7b performance
  2. https://www.databricks.com/blog/mpt-7b
  3.  
  4. Use 4090 instead of A100 (GaLore + fp8 with fp32 acc (unit scaling paper) or fp16 with fp16 acc (using pairwise accumulation for reduced error) gives 330 tflops) -> 1/5 cost on vast.ai compared to mpt baseline of $2 per hour per gpu (calculated from their publicly released data) -> $40,000
  5. https://arxiv.org/abs/2403.03507v1
  6. https://arxiv.org/abs/2303.11257
  7. https://discord.com/channels/1068976834382925865/1219897318384472064/1220080285190979705
  8.  
  9. Use fp8 with fp16 accumulate (combine unit scaling and pairwise accumulation (potentially impossible)) -> 2x boost to 660 tflops -> $20,000
  10. ** same references as above **
  11.  
  12. Flash attention 2 -> 2x improvement compared to original flash attention -> $10,000
  13. https://tridao.me/publications/flash2/flash2.pdf
  14.  
  15. Reduce context length to 128 (plus throw in sequence packing, larger batch sizes) then post-hoc scale back up to 8k+ using growlength style training or entropy-abf -> 2x+ reduction? -> $5,000
  16. https://arxiv.org/abs/2310.00576
  17. https://arxiv.org/abs/2401.07004
  18.  
  19. Training on a single node -> ~2x boost due to reduced communication overhead -> $2,500
  20. https://arxiv.org/pdf/2311.00257v2.pdf (just look at the drop in MFU from multi-node training)
  21.  
  22. Reducing dataset size from 5tb to ~1tb while maintaining performance with a minicpm-style training recipe -> 5x reduction -> $500
  23. https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20
  24. https://arxiv.org/pdf/2203.03466.pdf
  25.  
  26. LAWA high lr early weight averaging -> ~20% faster convergence -> $400
  27. https://arxiv.org/pdf/2306.03241.pdf
  28.  
  29. Google brain paper: transcending scaling laws with 0.1% extra compute paper -> 2x savings -> $200
  30. https://arxiv.org/abs/2210.11399
  31.  
Advertisement
Add Comment
Please, Sign In to add comment