Link model to paper and improve model card
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,53 +1,64 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
license: apache-2.0
|
|
|
|
| 3 |
tags:
|
| 4 |
- moe
|
| 5 |
- mixture-of-experts
|
| 6 |
- causal-lm
|
| 7 |
-
- olmoe
|
| 8 |
- distributed-training
|
| 9 |
- decentralized-training
|
| 10 |
- sparse-sync
|
| 11 |
-
language:
|
| 12 |
-
- en
|
| 13 |
-
pipeline_tag: text-generation
|
| 14 |
---
|
| 15 |
|
| 16 |
# SPES-9B
|
| 17 |
|
| 18 |
-
SPES-9B is a
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
|
|
|
|
| 21 |
|
| 22 |
## Model Details
|
| 23 |
|
| 24 |
- **Model name:** SPES-9B
|
| 25 |
-
- **Model type:** Causal language model
|
| 26 |
-
- **Parameters:** 9B
|
| 27 |
-
- **Framework:** SPES
|
| 28 |
- **License:** Apache-2.0
|
| 29 |
|
| 30 |
-
##
|
| 31 |
|
| 32 |
-
-
|
| 33 |
-
- **Paper:** https://huggingface.co/papers/2602.11543
|
| 34 |
|
| 35 |
-
##
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
|
| 40 |
-
- research on MoE training and synchronization
|
| 41 |
-
- experimentation and evaluation of pretrained language models
|
| 42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
## Citation
|
| 45 |
|
| 46 |
-
If you use this model
|
| 47 |
|
| 48 |
```bibtex
|
| 49 |
-
@article{
|
| 50 |
title={Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm},
|
| 51 |
-
author={Zhang, Jinrui and Xiao, Chaodong and Wu, Aoqi and Zhang, Xindong and Zhang, Lei},
|
|
|
|
| 52 |
year={2026}
|
| 53 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
license: apache-2.0
|
| 5 |
+
pipeline_tag: text-generation
|
| 6 |
tags:
|
| 7 |
- moe
|
| 8 |
- mixture-of-experts
|
| 9 |
- causal-lm
|
|
|
|
| 10 |
- distributed-training
|
| 11 |
- decentralized-training
|
| 12 |
- sparse-sync
|
|
|
|
|
|
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
# SPES-9B
|
| 16 |
|
| 17 |
+
SPES-9B is a 9B-parameter Mixture-of-Experts (MoE) language model pretrained using **SPES** (**SP**arse **E**xpert **S**ynchronization), a memory-efficient decentralized pretraining paradigm.
|
| 18 |
+
|
| 19 |
+
The model and framework were introduced in the paper: [Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm](https://huggingface.co/papers/2602.11543).
|
| 20 |
|
| 21 |
+
## Authors
|
| 22 |
+
Jinrui Zhang, Chaodong Xiao, Aoqi Wu, Xindong Zhang, and Lei Zhang.
|
| 23 |
|
| 24 |
## Model Details
|
| 25 |
|
| 26 |
- **Model name:** SPES-9B
|
| 27 |
+
- **Model type:** Causal Mixture-of-Experts (MoE) language model
|
| 28 |
+
- **Parameters:** 9B (Upcycled from a dense Qwen3-1.7B checkpoint)
|
| 29 |
+
- **Framework:** [SPES](https://github.com/zjr2000/SPES)
|
| 30 |
- **License:** Apache-2.0
|
| 31 |
|
| 32 |
+
## Description
|
| 33 |
|
| 34 |
+
SPES-9B was developed to address the memory constraints of GPU nodes in decentralized training environments. By training only a subset of experts per node, the SPES framework significantly reduces the memory footprint and eliminates the need for full-parameter transmission and high-speed cross-node interconnects. This allows the model to be trained effectively over standard internet connections while maintaining competitive performance compared to centralized baselines.
|
|
|
|
| 35 |
|
| 36 |
+
## Project Links
|
| 37 |
|
| 38 |
+
- **GitHub Repository:** [zjr2000/SPES](https://github.com/zjr2000/SPES)
|
| 39 |
+
- **Paper:** [Hugging Face Papers](https://huggingface.co/papers/2602.11543)
|
| 40 |
+
- **Training Logs:** [Weights & Biases](https://wandb.ai/zjr2000/spes/reports/SPES-9B-Train-Log--VmlldzoxNjI0MzA2Ng?accessToken=ghf43wkxavw7qnoolb9kcaeji2y8yg2dunvzowdid7jn02set7c10e1vc0t1bzi9)
|
| 41 |
|
| 42 |
+
## Intended Use
|
|
|
|
|
|
|
| 43 |
|
| 44 |
+
This model is intended for research on:
|
| 45 |
+
- Decentralized LLM pretraining paradigms.
|
| 46 |
+
- Mixture-of-Experts (MoE) training and synchronization mechanisms.
|
| 47 |
+
- Evaluation of pretrained language models trained under computational and bandwidth constraints.
|
| 48 |
|
| 49 |
## Citation
|
| 50 |
|
| 51 |
+
If you use this model or the SPES framework in your research, please cite:
|
| 52 |
|
| 53 |
```bibtex
|
| 54 |
+
@article{zhang2026pretraining,
|
| 55 |
title={Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm},
|
| 56 |
+
author={Zhang, Jinrui and Xiao, Chaodong and Wu, Aoqi District and Zhang, Xindong and Zhang, Lei},
|
| 57 |
+
journal={arXiv preprint arXiv:2602.11543},
|
| 58 |
year={2026}
|
| 59 |
+
}
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
## Acknowledgements
|
| 63 |
+
|
| 64 |
+
The SPES codebase is built upon the modeling and training infrastructure provided by [OLMo (Allen Institute for AI)](https://github.com/allenai/OLMo) and utilizes [MegaBlocks (Databricks)](https://github.com/databricks/megablocks) for efficient MoE operations.
|