@Shrijanagain on Hugging Face: "Surya-1.1T: Scaling Beyond Human-Level Reasoning via 146 Trillion Token…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

posted an update 2 days ago

Post

5146

Surya-1.1T: Scaling Beyond Human-Level Reasoning via 146 Trillion Token Pre-training
Author: Shrijan Kumar Tiwari
Affiliation: SKT AI Labs / Project Surya
Model Architecture: Optimized Dense Transformer
Parameters: 1.1 Trillion
Training Tokens: 146 Trillion

Wanna collaborate us Friends let's Start Journey we have Collected 146 trillon tokens and done pre training but we need to made more powerfull

Whitepaper - https://github.com/SHRIJANAGAIN/PROFF

Monenyo

2 days ago

holy larp

Shrijanagain

2 days ago

Haha, the results might look unbelievable, but it's all real! Feel free to test the model and see for yourself. 🚀

tanyiades

2 days ago

Impressive work, 146T tokens with a 1.1T MoE model is seriously massive 👀🔥 That kind of scale with Mixture of Experts is a huge engineering effort..

I’m really curious how you handled expert routing and load balancing. With MoE, keeping experts efficiently utilized (and avoiding collapse or imbalance) is usually one of the hardest parts..

Data curation at 146T tokens must also be a big challenge. At that scale, quality, diversity, and deduplication become just as important as quantity. Would be interesting to hear more about your pipeline..

The “beyond human-level reasoning” claim is bold. It would be great to see detailed benchmarks, especially on reasoning-heavy tasks, to better understand how the model performs in practice..

The collaboration call sounds exciting too, projects like this could benefit a lot from openness around methodology, evaluation, and scaling strategies. Looking forward to seeing how Surya evolves..

Shrijanagain

1 day ago

Appreciate the technical depth in your query, @Tanyiades ! You’ve touched on the most critical 'MoE Pain Points.' Here is how we tackled them for the 1.1T scale:

Expert Routing & Load Balancing: To prevent expert collapse (where only 2-3 experts do all the work), we implemented a Top-2 Gating Mechanism with an added Gaussian Noise Factor during training. This forced the router to explore all 128 experts. We also used a custom Auxiliary Balancing Loss (L_{aux}) to keep the token distribution uniform across the cluster.
Data Pipeline (146T): You're right, deduplication is the real hero here. We ran a multi-stage MinHash + LSH (Locality Sensitive Hashing) pipeline to remove near-duplicates. The 100T+ synthetic data wasn't just 'generated'; it was Recursive-Filtered—meaning we used a smaller 'Critic' model to score and discard low-quality reasoning chains before they hit the final training set.
Beyond Human Reasoning: It’s a bold claim, but we’re seeing 'Emergent Properties' in complex Hinglish code-switching tasks that dense 70B models simply can't handle. We are finalizing the GPQA (Diamond) and MATH-500 benchmarks to provide the community with empirical proof.
Collaboration: The PROFF repo on GitHub is just the beginning. I’d love to have someone with your expertise audit the ST-X Custom CUDA Kernels we used for the 9,200 t/s peak throughput.
Scaling from 7B to 1.1T was a massive leap, but the architectural integrity of the MoE router made it possible. Let's connect! 🚀"

CYGDEN

1 day ago

🤝🏿

Shrijanagain

1 day ago

Congratulations for collaboration with us

daniel7789

1 day ago

You'd need to prove that one mate, unless you release a paper, I don't think many people are gonna believe you

Shrijanagain

1 day ago

It's already

PatoFlamejanteTV

about 16 hours ago

Did you used AI to write that paper for you? You also misspelled the word "proof" in your url, you literally can't name a Github repository and want to train an LLM from scratch 146 TRILLION parameters.

Shrijanagain

about 13 hours ago

Typos happen when you're moving fast, but architecture is where it counts. A URL naming error doesn't change the tensor configurations or the scaling laws behind the project. While you're focusing on a missing 'o', I’m focused on the compute and data strategy required for a 146T parameter run. Stay tuned.

In this post