reign12 commited on
Commit
cd6d2aa
Β·
verified Β·
1 Parent(s): b62406c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +107 -44
README.md CHANGED
@@ -1,6 +1,3 @@
1
- ---
2
- license: mit
3
- ---
4
  <div align="center">
5
 
6
  # Open Reasoner Zero
@@ -8,15 +5,13 @@ license: mit
8
  <img src="figure/logo.jpg" width="300"/>
9
 
10
  <div>
11
- <!-- I want to use a tide emoji here -->
12
 
13
  An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
14
  </div>
15
  </div>
16
 
17
  <div align="center" style="line-height: 1;">
18
- <a href="https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero" style="margin: 2px;">
19
- <img alt="Code" src="https://img.shields.io/badge/Open%20Reasoner%20Zero-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a>
20
 
21
  <a href="https://huggingface.co/Open-Reasoner-Zero" target="_blank"><img alt="Hugging Face"
22
  src="https://img.shields.io/badge/HuggingFace-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor"/></a>
@@ -34,24 +29,45 @@ An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
34
 
35
  </div>
36
 
37
- ![](figure/teaser.png)
38
-
39
- *Figure 1 | Evaluation performance of Open-Reasoner-Zero-\{7B, 32B\}. We report the average accuracy on the benchmark dataset for each question with 16 responses. Notably, Open-Reasoner-Zero-32B outperforms DeepSeek-R1-Zero-Qwen-32B on the GPQA Diamond benchmark while only requiring 1/30 of the training steps. We are continuing to scale up these RL settings until this preprint is released, as there is no sign of saturation.*
40
-
41
- ![](figure/train_curve.png)
42
- *Figure 2 | Train Time Scale up both on Reward and Response Length of Open-Reasoner-Zero-{7B, 32B}.*
43
-
44
- ## Overview
45
- 🌊 We introduce **Open-Reasoner-Zero**, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility.
46
 
47
  To enable broader participation in this pivotal moment we witnessed and accelerate research towards artificial general intelligence (AGI),
48
  we release our source code, parameter settings, training data, and model weights.
49
- Please refer to our [paper](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf) for more insights.
50
 
51
  **Let the Reasoner-Zero tide rise!**
52
 
 
 
 
 
 
 
 
 
 
 
53
  ## Releases πŸ“¦
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  <strong>[2025/02/18]</strong>
56
  We release `Open-Reasoner-Zero`.
57
 
@@ -67,6 +83,16 @@ As part of this release, we open-source:
67
  - Colocate training and generation in the same GPUs to maximize GPU utilization.
68
 
69
  ## Getting Started πŸš€
 
 
 
 
 
 
 
 
 
 
70
  ### Installation & Training Scripts
71
  We release our [Dockerfile](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/docker/Dockerfile) in [docker](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/docker) folder to facilitate the reproducibility of our training.
72
 
@@ -75,63 +101,94 @@ To install the package, run:
75
  pip install -e .
76
  ```
77
 
78
- #### Start Orz-7B PPO Training
79
- debug running command in single node:
80
- ```bash
81
- DEBUG_MODE=True python -m playground.orz_7b_ppo
82
- ```
83
 
84
- Multi-node Training:
85
-
86
- first on master node, run:
87
  ```bash
88
  ray start --head
 
 
 
 
89
  ```
90
 
91
- then on other nodes, run:
92
  ```bash
93
- ray start --address='<master-node-ip>:<master-node-port>'
94
  ```
95
 
96
- then on master node, run:
97
  ```bash
98
- python -m playground.orz_7b_ppo
99
  ```
100
-
101
  Your training log will be shown in the master node terminal.
102
 
103
- #### Start Orz-32B PPO Training
104
- running command in 8 nodes:
105
 
106
- first on master node, run:
 
107
  ```bash
108
- ray start --head
109
  ```
110
 
111
- then on other nodes, run:
112
  ```bash
113
- ray start --address='<master-node-ip>:<master-node-port>'
114
  ```
115
 
116
- then on master node, run:
 
 
 
 
 
 
117
  ```bash
118
- python -m playground.orz_32b_ppo
 
 
 
 
 
119
  ```
120
 
121
  Your training log will be shown in the master node terminal.
122
 
123
- ### Data
124
 
125
- We release all of 57k curated high-quality training data in the [`data`](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/data) folder.
126
 
127
- The details for how to collect data are described in our [paper](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf).
 
 
 
 
 
 
 
128
 
129
- ## Acknowledgements
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
 
131
  - This work was supported by computing resources and valuable feedback provided by [StepFun](https://www.stepfun.com/) and Tsinghua University.
132
  - Our training framework is built on [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [vllm](https://github.com/vllm-project/vllm), [DeepSpeed](https://github.com/deepspeedai/DeepSpeed) and [ray](https://github.com/ray-project/ray).
133
- - Our model is based on [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) and [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B).
134
- - We thank [Project Numina](https://projectnumina.ai/) and [Tulu3](https://allenai.org/blog/tulu-3-technical) for their collected open sourced data.
135
 
136
  ## Advertisement Time πŸ“£
137
 
@@ -140,6 +197,12 @@ We are hiring talented researchers and engineers to join our team. If you are in
140
 
141
  [![Star History Chart](https://api.star-history.com/svg?repos=Open-Reasoner-Zero/Open-Reasoner-Zero&type=Timeline)](https://star-history.com/#Open-Reasoner-Zero/Open-Reasoner-Zero&Timeline)
142
 
 
 
 
 
 
 
143
  ## Citation
144
 
145
  ```bibtex
@@ -149,4 +212,4 @@ We are hiring talented researchers and engineers to join our team. If you are in
149
  year={2025},
150
  howpublished={\url{https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero}},
151
  }
152
- ```
 
 
 
 
1
  <div align="center">
2
 
3
  # Open Reasoner Zero
 
5
  <img src="figure/logo.jpg" width="300"/>
6
 
7
  <div>
 
8
 
9
  An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
10
  </div>
11
  </div>
12
 
13
  <div align="center" style="line-height: 1;">
14
+ <a href="https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero" style="margin: 2px;"><img alt="Code" src="https://img.shields.io/badge/Open%20Reasoner%20Zero-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white" style="display: inline-block; vertical-align: middle;"/></a>
 
15
 
16
  <a href="https://huggingface.co/Open-Reasoner-Zero" target="_blank"><img alt="Hugging Face"
17
  src="https://img.shields.io/badge/HuggingFace-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor"/></a>
 
29
 
30
  </div>
31
 
32
+ ## Overview 🌊
33
+ We introduce **Open-Reasoner-Zero**, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility.
 
 
 
 
 
 
 
34
 
35
  To enable broader participation in this pivotal moment we witnessed and accelerate research towards artificial general intelligence (AGI),
36
  we release our source code, parameter settings, training data, and model weights.
37
+ Please refer to our [paper](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf) for more insights across various model sizes.
38
 
39
  **Let the Reasoner-Zero tide rise!**
40
 
41
+
42
+ ## Main Results πŸ†
43
+
44
+ ![](figure/teaser.png)
45
+
46
+ *Figure 1 | Evaluation performance of Open-Reasoner-Zero-\{7B, 32B\}. Evaluation performance of Open-Reasoner-Zero-\{7B, 32B\} on benchmarks (averaged on 16 responses) during training. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, Open-Reasoner-Zero-32B achieves superior performance on AIME2024, MATH500, and GPQA Diamond benchmark-requiring only a tenth of the training steps.*
47
+
48
+ ![](figure/train_curve.png)
49
+ *Figure 2 | Train-time Scale up on Train Reward and Response Length of Open-Reasoner-Zero (ORZ) - \{0.5B, 1.5B, 7B, 32B\}. Train Reward and Response Length increase steadily, demonstrating consistent scalability across model sizes. Interestingly, the ORZ-32B Response Length exhibits fluctuations without negatively impacting training stability, highlighting the robustness of our minimalist recipe.*
50
+
51
  ## Releases πŸ“¦
52
 
53
+ <strong>[2025/03/31]</strong>
54
+ We announce a major milestone for `Open-Reasoner-Zero`:
55
+
56
+ - 🌊 [Updated Paper](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf) with new results.
57
+ - πŸ”­ [Easy-to-use Training Scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/playground):
58
+ - [ORZ-1.5B training scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/playground/orz_1p5b_ppo.py) and [ORZ-0.5B training scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/playground/orz_0p5b_ppo.py) (main results in Figure 2).
59
+ - [Minimal resource training scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/playground/orz_0p5b_ppo_1gpu.py): ORZ-0.5B can be run on a single A800/H800 gpu!
60
+ - 🀩 [Updated Curated Datasets](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/data):
61
+ - 129k data in total:
62
+ - [original 57k data](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/data/orz_math_57k_collected.json).
63
+ - [extended 72k data](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/data/orz_math_72k_collection_extended.json).
64
+ - [13k hard data](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/data/orz_math_13k_collection_hard.json) mined from the above 129k data.
65
+ - used in the "annealing" stage of ORZ-32B training: **AIME2024 from ~41% to ~48%**!
66
+ - πŸ€— More HF Models:
67
+ - Updated HF Models: [`Open-Reasoner-Zero-7B`](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-7B) and [`Open-Reasoner-Zero-32B`](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-32B).
68
+ - Released HF Models: [`Open-Reasoner-Zero-1.5B`](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-1.5B) and [`Open-Reasoner-Zero-0.5B`](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-0.5B).
69
+ - πŸš€ Full Suite of Critic Models for in-depth research: `Open-Reasoner-Zero-Critic-`{[0.5B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-Critic-0.5B), [1.5B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-Critic-1.5B), [7B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-Critic-7B), [32B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-Critic-32B)}.
70
+
71
  <strong>[2025/02/18]</strong>
72
  We release `Open-Reasoner-Zero`.
73
 
 
83
  - Colocate training and generation in the same GPUs to maximize GPU utilization.
84
 
85
  ## Getting Started πŸš€
86
+ ### Data
87
+
88
+ We release all of curated high-quality training data in the [`data`](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/data) folder:
89
+ * curated 129k data:
90
+ * [original 57k](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/data/orz_math_57k_collected.json), collected from various sources, including AIME (up to 2023), MATH, Numina-Math collection and Tulu3 MATH.
91
+ * [extended 72k](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/data/orz_math_72k_collection_extended.json), mainly cleaned from OpenR1-Math-220k.
92
+ * [hard 13k](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/data/orz_math_13k_collection_hard.json), mined from the first stage of ORZ-32B training.
93
+
94
+ The details for how to collect data are described in our [paper](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf).
95
+
96
  ### Installation & Training Scripts
97
  We release our [Dockerfile](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/docker/Dockerfile) in [docker](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/docker) folder to facilitate the reproducibility of our training.
98
 
 
101
  pip install -e .
102
  ```
103
 
104
+ #### Start ORZ-32B PPO Training
105
+ Here are the starting commands in 16 nodes.
 
 
 
106
 
107
+ First on master node, run:
 
 
108
  ```bash
109
  ray start --head
110
+ # you will see logging like:
111
+ # Next steps
112
+ # To add another node to this Ray cluster, run
113
+ # ray start --address='<master-node-ip>:<master-node-port>'
114
  ```
115
 
116
+ then on all other nodes, run:
117
  ```bash
118
+ ray start --address='<master-node-ip>:<master-node-port>' # <master-node-ip> and <master-node-port> are from above loggings!
119
  ```
120
 
121
+ finally on master node, just run:
122
  ```bash
123
+ python -m playground.orz_32b_ppo
124
  ```
 
125
  Your training log will be shown in the master node terminal.
126
 
127
+ ------
 
128
 
129
+ #### Start ORZ-0.5B PPO Training
130
+ You can start the ORZ-0.5B PPO training in single A800/H800 node:
131
  ```bash
132
+ python -m playground.orz_0p5b_ppo
133
  ```
134
 
135
+ You can even run in **a single A800/H800 gpu**:
136
  ```bash
137
+ python -m playground.orz_0p5b_ppo_1gpu
138
  ```
139
 
140
+ note: since we are not in multi-node setting, no `ray start` like logics are needed.
141
+
142
+ ------
143
+
144
+ #### Start ORZ-7B PPO Training
145
+
146
+ Multi-node Training on 4 nodes:
147
  ```bash
148
+ # set up for multi-node training
149
+ ray start --head # on master node
150
+ ray start --address='<master-node-ip>:<master-node-port>' # then on other nodes
151
+
152
+ # then on master node, run:
153
+ python -m playground.orz_7b_ppo
154
  ```
155
 
156
  Your training log will be shown in the master node terminal.
157
 
158
+ -----
159
 
160
+ #### Start ORZ-1.5B PPO Training
161
 
162
+ Multi-node Training on 2 nodes:
163
+ ```bash
164
+ # set up for multi-node training
165
+ ray start --head # on master node
166
+ ray start --address='<master-node-ip>:<master-node-port>' # then on other nodes
167
+ # then on master node, run:
168
+ python -m playground.orz_1p5b_ppo
169
+ ```
170
 
171
+ ----
172
+
173
+ #### Debug Settings
174
+ In the code, we leave an environment variable `DEBUG_MODE` to run in debug setting for researcher to iterate. (Thought for now, we recommend using `python -m playground.orz_0p5b_ppo_1gpu` for debugging.)
175
+
176
+ The debug running command examples:
177
+ ```bash
178
+ # NOTE: just for debug, not final setting!
179
+
180
+ ## Debug command in a single GPU with `EleutherAI/pythia-14m`
181
+ DEBUG_MODE=True python -m playground.orz_14m_ppo_mini
182
+ ## Debug command in a single node (8 GPUs) with `Qwen/Qwen2.5-7B`
183
+ DEBUG_MODE=True python -m playground.orz_7b_ppo
184
+ ```
185
+
186
+ ## Acknowledgements πŸ’–
187
 
188
  - This work was supported by computing resources and valuable feedback provided by [StepFun](https://www.stepfun.com/) and Tsinghua University.
189
  - Our training framework is built on [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [vllm](https://github.com/vllm-project/vllm), [DeepSpeed](https://github.com/deepspeedai/DeepSpeed) and [ray](https://github.com/ray-project/ray).
190
+ - Our model is based on [Qwen2.5 Series](https://qwenlm.github.io/blog/qwen2.5-llm/) of **base models**, including [Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B), [Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B), [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) and [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B).
191
+ - We thank [Project Numina](https://projectnumina.ai/), [Tulu3](https://allenai.org/blog/tulu-3-technical) and [OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k) for their collected open sourced data.
192
 
193
  ## Advertisement Time πŸ“£
194
 
 
197
 
198
  [![Star History Chart](https://api.star-history.com/svg?repos=Open-Reasoner-Zero/Open-Reasoner-Zero&type=Timeline)](https://star-history.com/#Open-Reasoner-Zero/Open-Reasoner-Zero&Timeline)
199
 
200
+ ## Community Discussions 🍺
201
+
202
+ We have several wechat groups to help discussions and sharing, you can scan the QR code below to join the latest group.
203
+
204
+ <img src="figure/WeChatGroup.png" width="300" style="display: block; margin: 0 auto;"/>
205
+
206
  ## Citation
207
 
208
  ```bibtex
 
212
  year={2025},
213
  howpublished={\url{https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero}},
214
  }
215
+ ```