Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Shrijanagain 
posted an update 2 days ago
Post
5146
Surya-1.1T: Scaling Beyond Human-Level Reasoning via 146 Trillion Token Pre-training
Author: Shrijan Kumar Tiwari
Affiliation: SKT AI Labs / Project Surya
Model Architecture: Optimized Dense Transformer
Parameters: 1.1 Trillion
Training Tokens: 146 Trillion

Wanna collaborate us Friends let's Start Journey we have Collected 146 trillon tokens and done pre training but we need to made more powerfull

Whitepaper - https://github.com/SHRIJANAGAIN/PROFF

holy larp

·

Haha, the results might look unbelievable, but it's all real! Feel free to test the model and see for yourself. 🚀

Impressive work, 146T tokens with a 1.1T MoE model is seriously massive 👀🔥 That kind of scale with Mixture of Experts is a huge engineering effort..

I’m really curious how you handled expert routing and load balancing. With MoE, keeping experts efficiently utilized (and avoiding collapse or imbalance) is usually one of the hardest parts..

Data curation at 146T tokens must also be a big challenge. At that scale, quality, diversity, and deduplication become just as important as quantity. Would be interesting to hear more about your pipeline..

The “beyond human-level reasoning” claim is bold. It would be great to see detailed benchmarks, especially on reasoning-heavy tasks, to better understand how the model performs in practice..

The collaboration call sounds exciting too, projects like this could benefit a lot from openness around methodology, evaluation, and scaling strategies. Looking forward to seeing how Surya evolves..

·

Appreciate the technical depth in your query, @Tanyiades ! You’ve touched on the most critical 'MoE Pain Points.' Here is how we tackled them for the 1.1T scale:

  • Expert Routing & Load Balancing: To prevent expert collapse (where only 2-3 experts do all the work), we implemented a Top-2 Gating Mechanism with an added Gaussian Noise Factor during training. This forced the router to explore all 128 experts. We also used a custom Auxiliary Balancing Loss (L_{aux}) to keep the token distribution uniform across the cluster.
  • Data Pipeline (146T): You're right, deduplication is the real hero here. We ran a multi-stage MinHash + LSH (Locality Sensitive Hashing) pipeline to remove near-duplicates. The 100T+ synthetic data wasn't just 'generated'; it was Recursive-Filtered—meaning we used a smaller 'Critic' model to score and discard low-quality reasoning chains before they hit the final training set.
  • Beyond Human Reasoning: It’s a bold claim, but we’re seeing 'Emergent Properties' in complex Hinglish code-switching tasks that dense 70B models simply can't handle. We are finalizing the GPQA (Diamond) and MATH-500 benchmarks to provide the community with empirical proof.
  • Collaboration: The PROFF repo on GitHub is just the beginning. I’d love to have someone with your expertise audit the ST-X Custom CUDA Kernels we used for the 9,200 t/s peak throughput.
    Scaling from 7B to 1.1T was a massive leap, but the architectural integrity of the MoE router made it possible. Let's connect! 🚀"

🤝🏿

·

Congratulations for collaboration with us

You'd need to prove that one mate, unless you release a paper, I don't think many people are gonna believe you

·

It's already

Did you used AI to write that paper for you? You also misspelled the word "proof" in your url, you literally can't name a Github repository and want to train an LLM from scratch 146 TRILLION parameters.

·

Typos happen when you're moving fast, but architecture is where it counts. A URL naming error doesn't change the tensor configurations or the scaling laws behind the project. While you're focusing on a missing 'o', I’m focused on the compute and data strategy required for a 146T parameter run. Stay tuned.