After training ððŠðšð¥ððð on ððð ððððð¬ for nearly a month, I've come to realize something most people overlook: ð¢ð§ðð«ðð¬ðð«ð®ððð®ð«ð ð¢ð¬ ðð¡ð ðŠðð€ð-ðšð«-ðð«ððð€ ðððððšð« ð¢ð§ ððð ðð«ðð¢ð§ð¢ð§ð . ð¥
Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious ðððð ðð«ð«ðšð«ð¬, or when your expensive GPU cluster is running at ðð% ðððð¢ðð¢ðð§ðð², the problem isn't your model. It's most probably a ðŠð¢ð¬ð®ð¬ð ðšð ðð¡ð ð¡ðð«ðð°ðð«ð. ð ïž
Questions that seemed simple but had no clear answers: Why is ððšð ðð«ðð¢ð§ð¢ð§ð ð¬ð¥ðšð°ðð« ðð¡ðð§ ððð§ð¬ð ðŠðšððð¥ð¬? Which ðððð ðð¥ðð ð¬ should we actually set? How often should we checkpoint without killing throughput?
That's why we built ðð¡ð ððŠðšð¥ ðð«ðð¢ð§ð¢ð§ð ðð¥ðð²ððšðšð€ ð: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the ð¢ð§ðð«ðð¬ðð«ð®ððð®ð«ð ð¥ðð²ðð« that most teams get wrong.
We validated real vs theoretical bandwidth across the entire stack: ðððð ð¡ð¢ððð¢ð§ð ð ðð/ð¬, ðððð¢ð§ð€ ð.ð ð«ðððð¡ð¢ð§ð ððð ðð/ð¬, ðððð ððð§ð ðð ðð.ð ðð/ð¬. Then we ran collective operations across ððð ðððð¬ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from ððð ðð/ð¬ on a single node to ððð-ððð ðð/ð¬ across 16 nodes.
If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.
3C3H AraGen Leaderboard welcomes today deepseek-ai/DeepSeek-V3 and 12 other models (including the late gpt-3.5 ð) to the ranking of best LLMs in Arabic !
Observations: - DeepSeek-v3 ranked 3rd and only Open model among the top 5 !
- A 14B open model (Qwen/Qwen2.5-14B-Instruct) outperforms gpt-3.5-turbo-0125 (from last year). This shows how much we came in advancing and supporting Arabic presence within the LLM ecosystem !
- Contrary to what observed in likelihood-acc leaderboards (like OALL/Open-Arabic-LLM-Leaderboard) further finetuned models like maldv/Qwentile2.5-32B-Instruct actually decreased the performance compared to the original model Qwen/Qwen2.5-32B-Instruct. It's worth to note that the decrease is statiscally insignificant which imply that at best, the out-domain finetuning do not really hurts the model original capabilities acquired during pretraining. Previous work addressed this (finetuning VS pretraining) but more investigation in this regard is required (any PhDs here ? This could be your question ...)