Auraithm commited on
Commit
72b5d8d
ยท
verified ยท
1 Parent(s): 72f2668

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +224 -70
README.md CHANGED
@@ -1,98 +1,252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- language:
3
- - en
4
- - zh
5
- license: apache-2.0
6
- tags:
7
- - math
8
- - reasoning
9
- - diffusion
10
- base_model: JetLM/SDAR-8B-Chat
11
- model_type: sdar
12
- ---
13
 
14
- <h1 align="center">DiRL-8B-Instruct</h1>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  <p align="center">
17
- <a href="https://github.com/OpenMOSS/DiRL">๐Ÿ’ปGithub Repo</a>
18
  </p>
19
 
20
- ## Introduction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
- **DiRL-8B-Instruct** is an 8B parameter diffusion language model specialized for mathematical reasoning. It is trained using the [DiRL](https://github.com/OpenMOSS/DiRL) framework based on [SDAR-8B-Chat](https://huggingface.co/JetLM/SDAR-8B-Chat). Through two-stage training (SFT + RL), DiRL-8B-Instruct achieves state-of-the-art results at the 8B scale on mathematical reasoning benchmarks, even outperforming 32B models on most tasks.
23
 
24
- > **Highlights**
25
- >
26
- > * **SOTA Performance:** Achieves **83.05%** on MATH500, **20.63%** on AIME2024, and **20.83%** on AIME2025, surpassing all 8B baselines.
27
- > * **Training Framework:** Trained with [DiRL](https://github.com/OpenMOSS/DiRL), an efficient training framework for diffusion language models.
28
- > * **Strong Baseline:** Built on [SDAR-8B-Chat](https://huggingface.co/JetLM/SDAR-8B-Chat), gaining **+11.20%** on MATH500 and **+11.46%** on AIME2024.
29
 
30
- ## Inference
 
 
31
 
32
- ### Using LMDeploy
33
 
34
  ```python
35
  from lmdeploy import pipeline, PytorchEngineConfig, GenerationConfig
36
  from transformers import AutoTokenizer
37
 
38
- model_path = "OpenMOSS-Team/DiRL-8B-Instruct"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
- # Load tokenizer
41
- tokenizer = AutoTokenizer.from_pretrained(model_path)
42
 
43
- # Prepare prompts
44
- prompts = [
45
- [{"role": "user", "content": "Solve: If x + 5 = 12, what is x?"}],
 
 
 
 
 
 
46
  ]
47
- prompts = tokenizer.apply_chat_template(prompts, tokenize=False, add_generation_prompt=True)
48
-
49
- # Configure backend for DLLM inference
50
- backend_config = PytorchEngineConfig(
51
- dtype="float16",
52
- max_prefill_token_num=8192,
53
- cache_max_entry_count=0.8,
54
- dllm_block_length=4,
55
- dllm_denoising_steps=4,
56
- dllm_unmasking_strategy="low_confidence_dynamic",
57
- dllm_confidence_threshold=0.9,
58
- )
59
-
60
- # Create inference pipeline
61
- with pipeline(model_path, backend_config=backend_config) as pipe:
62
- gen_config = GenerationConfig(
63
- top_p=1.0,
64
- top_k=50,
65
- temperature=1.0,
66
- do_sample=False, # greedy decoding
67
- max_new_tokens=8192,
68
- )
69
-
70
- outputs = pipe(prompts, gen_config=gen_config)
71
-
72
- for output in outputs:
73
- print(output.text)
74
  ```
75
 
76
- ## Performance
 
 
 
 
 
 
 
 
77
 
78
- | Model | MATH500 | GSM8K | AIME2024 | AIME2025 | OlympiadBench | Average |
79
- |-------|---------|-------|----------|----------|---------------|---------|
80
- | Qwen2.5-7B-Instruct | 73.78 | 89.78 | 8.96 | 5.63 | 36.58 | 42.95 |
81
- | Qwen2.5-32B-Instruct | 81.13 | **94.03** | 12.92 | 11.88 | 45.65 | 49.12 |
82
- | SDAR-8B-Chat | 71.85 | 89.87 | 9.17 | 9.38 | 36.03 | 43.26 |
83
- | Trado-8B-Instruct | 75.59 | 91.06 | 11.67 | 15.00 | 40.32 | 46.73 |
84
- | **DiRL-8B-Instruct** | **83.05** | 93.03 | **20.63** | **20.83** | **46.40** | **52.79** |
85
 
86
- ## Citation
87
 
88
- If you use this model in your research, please cite:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  ```bibtex
91
  @misc{zhu2025dirl,
92
- title={DiRL: An Efficient Training Framework for Diffusion Language Models},
93
- author={Zhu, Ying and Wan, Jiaxin and Liang, Tianyi and Guo, Xu and Liu, Xiaoran and Huang, Zengfeng and He, Ziwei and Qiu, Xipeng},
94
  year={2025},
95
- institution={Fudan University, Shanghai Innovation Institute},
96
- url={https://github.com/OpenMOSS/DiRL}
 
 
97
  }
98
- ```
 
1
+ <div align="center">
2
+
3
+ <p align="center">
4
+ <img src="static/images/DiRL.jpg" alt="DiRL" width="300">
5
+ </p>
6
+
7
+ <!-- <h1>DiRL</h1> -->
8
+
9
+ <h2>An Efficient Post-Training Framework for Diffusion Language Models</h2>
10
+
11
+ <p>
12
+ <b>Ying Zhu</b><sup>1,2,3</sup>, <b>Jiaxin Wan</b><sup>2</sup>, <b>Xiaoran Liu</b><sup>1,2,3</sup>, <b>Siyanag He</b><sup>1,2,3</sup>, <b>Qiqi Wang</b><sup>1,2,3</sup>,<br>
13
+ <b>Xu Guo</b><sup>1,2</sup>, <b>Tianyi Liang</b><sup>2,3</sup>, <b>Zengfeng Huang</b><sup>1,2</sup>, <b>Ziwei He</b><sup>2,3,โ€ </sup>, <b>Xipeng Qiu</b><sup>1,2,โ€ </sup>
14
+ </p>
15
+
16
+ <p>
17
+ <sup>1</sup>Fudan University &nbsp;&nbsp; <sup>2</sup>Shanghai Innovation Institute &nbsp;&nbsp; <sup>3</sup>OpenMoss Team
18
+ </p>
19
+
20
+ <p>
21
+ <sup>โ€ </sup>Corresponding authors
22
+ </p>
23
+
24
+ </div>
25
+
26
+ <p align="center">
27
+ <a href="https://arxiv.org/abs/2512.22234">
28
+ <img src="https://img.shields.io/badge/arXiv-2512.22234-b31b1b.svg" alt="Paper on arXiv"/>
29
+ </a>
30
+ <a href="https://github.com/OpenMOSS/DiRL">
31
+ <img src="https://img.shields.io/badge/GitHub-Code-black.svg?logo=github" alt="GitHub Code"/>
32
+ </a>
33
+ <a href="https://huggingface.co/OpenMOSS-Team/DiRL-8B-Instruct">
34
+ <img src="https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-Model-yellow.svg" alt="Hugging Face Model"/>
35
+ </a>
36
+ <a href="https://huggingface.co/collections/Auraithm/dirl">
37
+ <img src="https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-Data-yellow.svg" alt="Hugging Face Data"/>
38
+ </a>
39
+ <a href="LICENSE">
40
+ <img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License"/>
41
+ </a>
42
+ </p>
43
+
44
+ <p align="center">
45
+ <img src="static/images/accuracy.png" alt="Overview" width="750">
46
+ </p>
47
+
48
  ---
 
 
 
 
 
 
 
 
 
 
 
49
 
50
+ ## ๐ŸŒŸ TL;DR
51
+
52
+ We introduce **DiRL**, an open-source training framework for Diffusion Language Models (DLLMs) with SFT and RL stages. Using this framework, we train **DiRL-8B-Instruct**, achieving state-of-the-art results at the 8B scale on mathematical reasoning benchmarks, even outperforming 32B models on most tasks.
53
+
54
+ ## ๐ŸŒฑ HighLights
55
+
56
+ - **๐ŸŽฏ Novel RL Algorithm:** We propose **DiPO (Discrete Diffusion Policy Optimization)**, an RL algorithm that optimizes at the generation step level for DLLMs. It achieves unbiased implementation with complete consistency between optimization objectives and training process, and integrates dynamic sampling from DAPO during rollout to filter out low-quality data.
57
+
58
+ - **๐Ÿš€ Efficient Training & Inference:** We support **Accelerate** framework for distributed training and **LMDeploy** inference engine for efficient rollout, while integrate **Speed Reward** mechanism to optimize inference speed at the training level, enabling both faster training and generation without sacrificing quality.
59
+
60
+ - **๐Ÿง  SOTA Performance:** We achieve state-of-the-art results at the 8B scale among both autoregressive (AR) models and diffusion language models (DLLMs) across multiple mathematical reasoning benchmarks. Specifically, we reach **83.05%** on MATH500, **20.63%** on AIME2024, and **20.83%** on AIME2025, surpassing all 8B baselines and even outperforming the 32B Qwen2.5-32B-Instruct model on AIME benchmarks.
61
+
62
+ ## ๐Ÿ“ฐ News
63
+
64
+ - **[2025.12]** ๐Ÿš€ Major framework update! We now support **Flex-Attention** for faster training, **LMDeploy API server** and **real-time policy updates** to enable **online RL**, and support **DAPO algorithm**. We also release the [technical report](https://arxiv.org/abs/2512.22234) and [training datasets](https://huggingface.co/collections/Auraithm/dirl).
65
+
66
+ - **[2025.11]** ๐ŸŽ‰ We release **DiRL**, an open-source post-training framework for Diffusion Language Models! Using this framework, we train **DiRL-8B-Instruct**, which achieves **state-of-the-art** results among 8B models. Released [code](https://github.com/OpenMOSS/DiRL) and [model](https://huggingface.co/OpenMOSS-Team/DiRL-8B-Instruct).
67
+
68
+ ## ๐Ÿง  Method
69
+
70
+ We develop and release an open-source diffusion post-training framework for DLLMs, and train **DiRL-8B-Instruct** based on **SDAR-8B-Chat** through two stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, we adopt a random-masking strategy to construct the training data for model fine-tuning. In the RL stage, we design an RL algorithm -- **DiPO (Discrete Diffusion Policy Optimization)**, which optimizes at the generation step level. We achieve an unbiased implementation of RL theory, ensuring complete consistency between the optimization objective and the actual training process. Additionally, during the rollout phase, we adopt dynamic sampling from DAPO to filter out data with zero advantage standard deviation. Through this two-stage training pipeline, we successfully train **DiRL-8B-Instruct**, a high-performance diffusion language model for mathematical reasoning.
71
+
72
+ ## ๐Ÿ“Š Performance
73
+
74
+ **DiRL-8B-Instruct** achieves state-of-the-art results among DLLMs across mathematical reasoning benchmarks. Highlights include **83.05%** on MATH500 (surpassing the base model by **+11.20%**), **20.63%** on AIME2024 and **20.83%** on AIME2025 (dramatically outperforming all baselines), and **46.40%** on OlympiadBench. Our 8B model achieves performance comparable to or exceeding much larger 32B models on most benchmarks.
75
 
76
  <p align="center">
77
+ <img src="static/images/performance.jpg" alt="Performance Comparison" width="750">
78
  </p>
79
 
80
+ ## ๐Ÿš€ Quick Start
81
+
82
+ ### Installation
83
+
84
+ ```bash
85
+ git clone https://github.com/OpenMOSS/DiRL.git
86
+ cd DiRL
87
+ pip install -r requirements.txt
88
+ ```
89
+
90
+ If `flash-attn` installation fails, you can download the pre-built wheel file and install it manually:
91
+
92
+ ```bash
93
+ wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
94
+ pip install flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
95
+ ```
96
 
97
+ ### Download Models and Datasets
98
 
99
+ Edit `download.sh` to set your Hugging Face token and username, then run:
 
 
 
 
100
 
101
+ ```bash
102
+ bash download.sh
103
+ ```
104
 
105
+ ### Inference
106
 
107
  ```python
108
  from lmdeploy import pipeline, PytorchEngineConfig, GenerationConfig
109
  from transformers import AutoTokenizer
110
 
111
+ if __name__ == '__main__':
112
+ model_path = "OpenMOSS-Team/DiRL-8B-Instruct"
113
+
114
+ # Load tokenizer
115
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
116
+
117
+ # Prepare prompts
118
+ prompts = [
119
+ [{"role": "user", "content": "Solve: If x + 5 = 12, what is x?"}],
120
+ ]
121
+ prompts = tokenizer.apply_chat_template(prompts, tokenize=False, add_generation_prompt=True)
122
+
123
+ # Configure backend for DLLM inference
124
+ backend_config = PytorchEngineConfig(
125
+ dtype="float16",
126
+ max_prefill_token_num=8192,
127
+ cache_max_entry_count=0.8,
128
+ dllm_block_length=4,
129
+ dllm_denoising_steps=4,
130
+ dllm_unmasking_strategy="low_confidence_dynamic",
131
+ dllm_confidence_threshold=0.9,
132
+ )
133
+
134
+ # Create inference pipeline
135
+ with pipeline(model_path, backend_config=backend_config) as pipe:
136
+ gen_config = GenerationConfig(
137
+ top_p=1.0,
138
+ top_k=50,
139
+ temperature=1.0,
140
+ do_sample=False, # greedy decoding
141
+ max_new_tokens=8192,
142
+ )
143
+
144
+ outputs = pipe(prompts, gen_config=gen_config)
145
+
146
+ for output in outputs:
147
+ print(output.text)
148
+ ```
149
+
150
+ ### Evaluation
151
+
152
+ To evaluate models on multiple benchmarks (MATH500, GSM8K, AIME2024, AIME2025, OlympiadBench):
153
+
154
+ ```bash
155
+ bash examples/eval.sh
156
+ ```
157
+
158
+ ### Training
159
+
160
+ **Step 1: Prepare Training Data**
161
+
162
+ While the full DiRL-8B-Instruct training data is not yet released, we provide lightweight datasets for quick experimentation:
163
+ - [Light-OpenR1Math-SFT](https://huggingface.co/datasets/Auraithm/Light-OpenR1Math-SFT): 2K SFT samples from OpenR1Math
164
+ - [Light-MATH-RL](https://huggingface.co/datasets/Auraithm/Light-MATH-RL): 4K RL samples from MATH
165
 
166
+ > **Tip:** For initial experimentation, we recommend starting with **max_new_tokens** of 2K to reduce training time and resource requirements.
 
167
 
168
+ You can also create your own training datasets following the formats below:
169
+
170
+ SFT training data format:
171
+ ```json
172
+ [
173
+ {
174
+ "prompt": "<|im_start|>user\n[question]<|im_end|>\n<|im_start|>assistant\n",
175
+ "response": "[answer]<|im_end|><|endoftext|>"
176
+ }
177
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
178
  ```
179
 
180
+ RL training data format:
181
+ ```json
182
+ [
183
+ {
184
+ "question": "[question]",
185
+ "ground_truth_answer": "[answer]"
186
+ }
187
+ ]
188
+ ```
189
 
190
+ **Step 2: Two-Stage Training**
 
 
 
 
 
 
191
 
192
+ **Stage 1: SFT Training**
193
 
194
+ Supervised fine-tuning with random-masking strategy to adapt the base model for mathematical reasoning tasks.
195
+
196
+ ```bash
197
+ bash examples/sft.sh
198
+ ```
199
+
200
+ **Stage 2: RL Training**
201
+
202
+ Reinforcement learning with DiPO algorithm to optimize the model at generation step level.
203
+
204
+ ```bash
205
+ bash examples/rl.sh
206
+ ```
207
+
208
+ ## ๐Ÿ“‹ Roadmap
209
+
210
+ - [x] Release Inference Engine and Training Framework
211
+ - [x] Release DiRL Technical Report
212
+ - [ ] Release Training Data of DiRL-8B-Instruct
213
+ - [ ] Release Thinking Model
214
+ - [ ] Support More RL Algorithms
215
+ - [ ] More Features are working in progress
216
+
217
+
218
+ ## ๐Ÿ‘ Acknowledgement
219
+
220
+ We would like to express our gratitude to the following works ([LLaDA](https://github.com/ML-GSAI/LLaDA), [SDAR](https://github.com/JetAstra/SDAR), [dllm-RL](https://github.com/Gen-Verse/dLLM-RL), [lmdeploy](https://github.com/InternLM/lmdeploy), [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory), [Flex-Attention](https://pytorch.org/blog/flexattention/)) for providing important theoretical foundations and inspiration for DiRL.
221
+
222
+ ## ๐Ÿ’ฌ Community
223
+
224
+ Join our WeChat group to discuss DLLM training and related topics:
225
+
226
+ <p align="center">
227
+ <img src="static/images/qr_code.jpg" alt="WeChat QR Code" width="400">
228
+ </p>
229
+
230
+
231
+ ## ๐Ÿ“ง Contact
232
+
233
+ For issues or inquiries:
234
+
235
+ - **Ying Zhu**, Shanghai Innovation Institute ([auraithm@gmail.com](mailto:auraithm@gmail.com))
236
+
237
+
238
+ ## ๐Ÿ“– Citation
239
+
240
+ If you find our work helpful, please consider citing:
241
 
242
  ```bibtex
243
  @misc{zhu2025dirl,
244
+ title={DiRL: An Efficient Post-Training Framework for Diffusion Language Models},
245
+ author={Zhu, Ying and Wan, Jiaxin and Liu, Xiaoran and He, Siyanag and Wang, Qiqi and Guo, Xu and Liang, Tianyi and Huang, Zengfeng and He, Ziwei and Qiu, Xipeng},
246
  year={2025},
247
+ eprint={2512.22234},
248
+ archivePrefix={arXiv},
249
+ primaryClass={cs.CL},
250
+ url={https://arxiv.org/abs/2512.22234}
251
  }
252
+ ```