zjr2000
/

SPES-9B

@@ -1,53 +1,64 @@
 ---
 license: apache-2.0
 tags:
 - moe
 - mixture-of-experts
 - causal-lm
-- olmoe
 - distributed-training
 - decentralized-training
 - sparse-sync
-language:
-- en
-pipeline_tag: text-generation
 ---
 # SPES-9B
-SPES-9B is a pretrained language model released as part of paper:
-**Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm**
 ## Model Details
 - **Model name:** SPES-9B
-- **Model type:** Causal language model
-- **Parameters:** 9B
-- **Framework:** SPES
 - **License:** Apache-2.0
-## Project Links
-- **GitHub:** https://github.com/zjr2000/SPES
-- **Paper:** https://huggingface.co/papers/2602.11543
-## Intended Use
-This model is intended for:
-- research on decentralized LLM pretraining
-- research on MoE training and synchronization
-- experimentation and evaluation of pretrained language models
 ## Citation
-If you use this model, please cite the SPES paper.
 ```bibtex
-@article{zhang2026spes,
   title={Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm},
-  author={Zhang, Jinrui and Xiao, Chaodong and Wu, Aoqi and Zhang, Xindong and Zhang, Lei},
   year={2026}
-}

 ---
+language:
+- en
 license: apache-2.0
+pipeline_tag: text-generation
 tags:
 - moe
 - mixture-of-experts
 - causal-lm
 - distributed-training
 - decentralized-training
 - sparse-sync
 ---
 # SPES-9B
+SPES-9B is a 9B-parameter Mixture-of-Experts (MoE) language model pretrained using **SPES** (**SP**arse **E**xpert **S**ynchronization), a memory-efficient decentralized pretraining paradigm.
+The model and framework were introduced in the paper: [Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm](https://huggingface.co/papers/2602.11543).
+## Authors
+Jinrui Zhang, Chaodong Xiao, Aoqi Wu, Xindong Zhang, and Lei Zhang.
 ## Model Details
 - **Model name:** SPES-9B
+- **Model type:** Causal Mixture-of-Experts (MoE) language model
+- **Parameters:** 9B (Upcycled from a dense Qwen3-1.7B checkpoint)
+- **Framework:** [SPES](https://github.com/zjr2000/SPES)
 - **License:** Apache-2.0
+## Description
+SPES-9B was developed to address the memory constraints of GPU nodes in decentralized training environments. By training only a subset of experts per node, the SPES framework significantly reduces the memory footprint and eliminates the need for full-parameter transmission and high-speed cross-node interconnects. This allows the model to be trained effectively over standard internet connections while maintaining competitive performance compared to centralized baselines.
+## Project Links
+- **GitHub Repository:** [zjr2000/SPES](https://github.com/zjr2000/SPES)
+- **Paper:** [Hugging Face Papers](https://huggingface.co/papers/2602.11543)
+- **Training Logs:** [Weights & Biases](https://wandb.ai/zjr2000/spes/reports/SPES-9B-Train-Log--VmlldzoxNjI0MzA2Ng?accessToken=ghf43wkxavw7qnoolb9kcaeji2y8yg2dunvzowdid7jn02set7c10e1vc0t1bzi9)
+## Intended Use
+This model is intended for research on:
+- Decentralized LLM pretraining paradigms.
+- Mixture-of-Experts (MoE) training and synchronization mechanisms.
+- Evaluation of pretrained language models trained under computational and bandwidth constraints.
 ## Citation
+If you use this model or the SPES framework in your research, please cite:
 ```bibtex
+@article{zhang2026pretraining,
   title={Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm},
+  author={Zhang, Jinrui and Xiao, Chaodong and Wu, Aoqi District and Zhang, Xindong and Zhang, Lei},
+  journal={arXiv preprint arXiv:2602.11543},
   year={2026}
+}
+```
+## Acknowledgements
+The SPES codebase is built upon the modeling and training infrastructure provided by [OLMo (Allen Institute for AI)](https://github.com/allenai/OLMo) and utilizes [MegaBlocks (Databricks)](https://github.com/databricks/megablocks) for efficient MoE operations.