Link model to paper and improve model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +32 -21
README.md CHANGED
@@ -1,53 +1,64 @@
1
  ---
 
 
2
  license: apache-2.0
 
3
  tags:
4
  - moe
5
  - mixture-of-experts
6
  - causal-lm
7
- - olmoe
8
  - distributed-training
9
  - decentralized-training
10
  - sparse-sync
11
- language:
12
- - en
13
- pipeline_tag: text-generation
14
  ---
15
 
16
  # SPES-9B
17
 
18
- SPES-9B is a pretrained language model released as part of paper:
 
 
19
 
20
- **Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm**
 
21
 
22
  ## Model Details
23
 
24
  - **Model name:** SPES-9B
25
- - **Model type:** Causal language model
26
- - **Parameters:** 9B
27
- - **Framework:** SPES
28
  - **License:** Apache-2.0
29
 
30
- ## Project Links
31
 
32
- - **GitHub:** https://github.com/zjr2000/SPES
33
- - **Paper:** https://huggingface.co/papers/2602.11543
34
 
35
- ## Intended Use
36
 
37
- This model is intended for:
 
 
38
 
39
- - research on decentralized LLM pretraining
40
- - research on MoE training and synchronization
41
- - experimentation and evaluation of pretrained language models
42
 
 
 
 
 
43
 
44
  ## Citation
45
 
46
- If you use this model, please cite the SPES paper.
47
 
48
  ```bibtex
49
- @article{zhang2026spes,
50
  title={Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm},
51
- author={Zhang, Jinrui and Xiao, Chaodong and Wu, Aoqi and Zhang, Xindong and Zhang, Lei},
 
52
  year={2026}
53
- }
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
  license: apache-2.0
5
+ pipeline_tag: text-generation
6
  tags:
7
  - moe
8
  - mixture-of-experts
9
  - causal-lm
 
10
  - distributed-training
11
  - decentralized-training
12
  - sparse-sync
 
 
 
13
  ---
14
 
15
  # SPES-9B
16
 
17
+ SPES-9B is a 9B-parameter Mixture-of-Experts (MoE) language model pretrained using **SPES** (**SP**arse **E**xpert **S**ynchronization), a memory-efficient decentralized pretraining paradigm.
18
+
19
+ The model and framework were introduced in the paper: [Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm](https://huggingface.co/papers/2602.11543).
20
 
21
+ ## Authors
22
+ Jinrui Zhang, Chaodong Xiao, Aoqi Wu, Xindong Zhang, and Lei Zhang.
23
 
24
  ## Model Details
25
 
26
  - **Model name:** SPES-9B
27
+ - **Model type:** Causal Mixture-of-Experts (MoE) language model
28
+ - **Parameters:** 9B (Upcycled from a dense Qwen3-1.7B checkpoint)
29
+ - **Framework:** [SPES](https://github.com/zjr2000/SPES)
30
  - **License:** Apache-2.0
31
 
32
+ ## Description
33
 
34
+ SPES-9B was developed to address the memory constraints of GPU nodes in decentralized training environments. By training only a subset of experts per node, the SPES framework significantly reduces the memory footprint and eliminates the need for full-parameter transmission and high-speed cross-node interconnects. This allows the model to be trained effectively over standard internet connections while maintaining competitive performance compared to centralized baselines.
 
35
 
36
+ ## Project Links
37
 
38
+ - **GitHub Repository:** [zjr2000/SPES](https://github.com/zjr2000/SPES)
39
+ - **Paper:** [Hugging Face Papers](https://huggingface.co/papers/2602.11543)
40
+ - **Training Logs:** [Weights & Biases](https://wandb.ai/zjr2000/spes/reports/SPES-9B-Train-Log--VmlldzoxNjI0MzA2Ng?accessToken=ghf43wkxavw7qnoolb9kcaeji2y8yg2dunvzowdid7jn02set7c10e1vc0t1bzi9)
41
 
42
+ ## Intended Use
 
 
43
 
44
+ This model is intended for research on:
45
+ - Decentralized LLM pretraining paradigms.
46
+ - Mixture-of-Experts (MoE) training and synchronization mechanisms.
47
+ - Evaluation of pretrained language models trained under computational and bandwidth constraints.
48
 
49
  ## Citation
50
 
51
+ If you use this model or the SPES framework in your research, please cite:
52
 
53
  ```bibtex
54
+ @article{zhang2026pretraining,
55
  title={Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm},
56
+ author={Zhang, Jinrui and Xiao, Chaodong and Wu, Aoqi District and Zhang, Xindong and Zhang, Lei},
57
+ journal={arXiv preprint arXiv:2602.11543},
58
  year={2026}
59
+ }
60
+ ```
61
+
62
+ ## Acknowledgements
63
+
64
+ The SPES codebase is built upon the modeling and training infrastructure provided by [OLMo (Allen Institute for AI)](https://github.com/allenai/OLMo) and utilizes [MegaBlocks (Databricks)](https://github.com/databricks/megablocks) for efficient MoE operations.