Impressive work, 146T tokens with a 1.1T MoE model is seriously massive 👀🔥 That kind of scale with Mixture of Experts is a huge engineering effort..
I’m really curious how you handled expert routing and load balancing. With MoE, keeping experts efficiently utilized (and avoiding collapse or imbalance) is usually one of the hardest parts..
Data curation at 146T tokens must also be a big challenge. At that scale, quality, diversity, and deduplication become just as important as quantity. Would be interesting to hear more about your pipeline..
The “beyond human-level reasoning” claim is bold. It would be great to see detailed benchmarks, especially on reasoning-heavy tasks, to better understand how the model performs in practice..
The collaboration call sounds exciting too, projects like this could benefit a lot from openness around methodology, evaluation, and scaling strategies. Looking forward to seeing how Surya evolves..