Improve model card with metadata and links
Browse filesThis PR improves the model card by:
- Adding the `pipeline_tag: reinforcement-learning` to the metadata, improving discoverability on the Hugging Face Hub.
- Specifying the `library_name: transformers` in the metadata.
- Correcting the paper link to point to the arXiv page.
This ensures the model is correctly categorized and easily discoverable by researchers interested in reinforcement learning models built using the Transformers library.
README.md
CHANGED
|
@@ -1,6 +1,9 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
| 4 |
<div align="center">
|
| 5 |
|
| 6 |
# Open Reasoner Zero
|
|
@@ -24,7 +27,7 @@ An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
|
|
| 24 |
src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white"/></a>
|
| 25 |
|
| 26 |
<br>
|
| 27 |
-
<a href="https://
|
| 28 |
</div>
|
| 29 |
|
| 30 |
<div>
|
|
@@ -34,10 +37,11 @@ An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
|
|
| 34 |
|
| 35 |
## Overview π
|
| 36 |
We introduce **Open-Reasoner-Zero**, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility.
|
|
|
|
| 37 |
|
| 38 |
To enable broader participation in this pivotal moment we witnessed and accelerate research towards artificial general intelligence (AGI),
|
| 39 |
we release our source code, parameter settings, training data, and model weights.
|
| 40 |
-
Please refer to our [paper](https://
|
| 41 |
|
| 42 |
**Let the Reasoner-Zero tide rise!**
|
| 43 |
|
|
@@ -46,17 +50,17 @@ Please refer to our [paper](https://github.com/Open-Reasoner-Zero/Open-Reasoner-
|
|
| 46 |
|
| 47 |

|
| 48 |
|
| 49 |
-
*Figure 1 | Evaluation performance of Open-Reasoner-Zero-\{7B, 32B\}. Evaluation performance of Open-Reasoner-Zero-\{7B, 32B\} on benchmarks (averaged on 16 responses) during training. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, Open-Reasoner-Zero-32B achieves superior performance on AIME2024, MATH500, and GPQA Diamond benchmark-requiring only a tenth of the training steps.*
|
| 50 |
|
| 51 |

|
| 52 |
*Figure 2 | Train-time Scale up on Train Reward and Response Length of Open-Reasoner-Zero (ORZ) - \{0.5B, 1.5B, 7B, 32B\}. Train Reward and Response Length increase steadily, demonstrating consistent scalability across model sizes. Interestingly, the ORZ-32B Response Length exhibits fluctuations without negatively impacting training stability, highlighting the robustness of our minimalist recipe.*
|
| 53 |
|
| 54 |
## Releases π¦
|
| 55 |
|
| 56 |
-
|
| 57 |
We announce a major milestone for `Open-Reasoner-Zero`:
|
| 58 |
|
| 59 |
-
- π [Updated Paper](https://
|
| 60 |
- π [Easy-to-use Training Scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/playground):
|
| 61 |
- [ORZ-1.5B training scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/playground/orz_1p5b_ppo.py) and [ORZ-0.5B training scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/playground/orz_0p5b_ppo.py) (main results in Figure 2).
|
| 62 |
- [Minimal resource training scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/playground/orz_0p5b_ppo_1gpu.py): ORZ-0.5B can be run on a single A800/H800 gpu!
|
|
@@ -71,11 +75,11 @@ We announce a major milestone for `Open-Reasoner-Zero`:
|
|
| 71 |
- Released HF Models: [`Open-Reasoner-Zero-1.5B`](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-1.5B) and [`Open-Reasoner-Zero-0.5B`](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-0.5B).
|
| 72 |
- π Full Suite of Critic Models for in-depth research: `Open-Reasoner-Zero-Critic-`{[0.5B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-Critic-0.5B), [1.5B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-Critic-1.5B), [7B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-Critic-7B), [32B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-Critic-32B)}.
|
| 73 |
|
| 74 |
-
|
| 75 |
We release `Open-Reasoner-Zero`.
|
| 76 |
|
| 77 |
As part of this release, we open-source:
|
| 78 |
-
- π [Paper](https://
|
| 79 |
- π€ HF Model [`Open-Reasoner-Zero-7B`](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-7B) and [`Open-Reasoner-Zero-32B`](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-32B)
|
| 80 |
- π [`Our curated 57k training data`](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/data)
|
| 81 |
- π [Training Scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/playground) to enjoy your own Reasoner-Zero journey!
|
|
@@ -94,7 +98,7 @@ We release all of curated high-quality training data in the [`data`](https://git
|
|
| 94 |
* [extended 72k](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/data/orz_math_72k_collection_extended.json), mainly cleaned from OpenR1-Math-220k.
|
| 95 |
* [hard 13k](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/data/orz_math_13k_collection_hard.json), mined from the first stage of ORZ-32B training.
|
| 96 |
|
| 97 |
-
The details for how to collect data are described in our [paper](https://
|
| 98 |
|
| 99 |
### Installation & Training Scripts
|
| 100 |
We release our [Dockerfile](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/docker/Dockerfile) in [docker](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/docker) folder to facilitate the reproducibility of our training.
|
|
@@ -186,6 +190,14 @@ DEBUG_MODE=True python -m playground.orz_14m_ppo_mini
|
|
| 186 |
DEBUG_MODE=True python -m playground.orz_7b_ppo
|
| 187 |
```
|
| 188 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
## Acknowledgements π
|
| 190 |
|
| 191 |
- This work was supported by computing resources and valuable feedback provided by [StepFun](https://www.stepfun.com/) and Tsinghua University.
|
|
@@ -209,11 +221,13 @@ We have several wechat groups to help discussions and sharing, you can scan the
|
|
| 209 |
## Citation
|
| 210 |
|
| 211 |
```bibtex
|
| 212 |
-
@misc{
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
|
|
|
|
|
|
|
|
|
| 217 |
}
|
| 218 |
-
```
|
| 219 |
-
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: reinforcement-learning
|
| 5 |
---
|
| 6 |
+
|
| 7 |
<div align="center">
|
| 8 |
|
| 9 |
# Open Reasoner Zero
|
|
|
|
| 27 |
src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white"/></a>
|
| 28 |
|
| 29 |
<br>
|
| 30 |
+
<a href="https://arxiv.org/abs/2503.24290"><b>Paper Arxiv Link</b>ποΈ</a>
|
| 31 |
</div>
|
| 32 |
|
| 33 |
<div>
|
|
|
|
| 37 |
|
| 38 |
## Overview π
|
| 39 |
We introduce **Open-Reasoner-Zero**, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility.
|
| 40 |
+
Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating remarkable efficiencyβrequiring only a tenth of the training steps, compared to DeepSeek-R1-Zero pipeline.
|
| 41 |
|
| 42 |
To enable broader participation in this pivotal moment we witnessed and accelerate research towards artificial general intelligence (AGI),
|
| 43 |
we release our source code, parameter settings, training data, and model weights.
|
| 44 |
+
Please refer to our [paper](https://arxiv.org/abs/2503.24290) for more insights across various model sizes.
|
| 45 |
|
| 46 |
**Let the Reasoner-Zero tide rise!**
|
| 47 |
|
|
|
|
| 50 |
|
| 51 |

|
| 52 |
|
| 53 |
+
*Figure 1 | Evaluation performance of Open-Reasoner-Zero-\{7B, 32B\}. Evaluation performance of Open-Reasoner-Zero-\{7B, 32B\} on benchmarks (averaged on 16 responses) during training. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, Open-Reasoner-Zero-32B achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark-requiring only a tenth of the training steps.*
|
| 54 |
|
| 55 |

|
| 56 |
*Figure 2 | Train-time Scale up on Train Reward and Response Length of Open-Reasoner-Zero (ORZ) - \{0.5B, 1.5B, 7B, 32B\}. Train Reward and Response Length increase steadily, demonstrating consistent scalability across model sizes. Interestingly, the ORZ-32B Response Length exhibits fluctuations without negatively impacting training stability, highlighting the robustness of our minimalist recipe.*
|
| 57 |
|
| 58 |
## Releases π¦
|
| 59 |
|
| 60 |
+
**[2025/03/31]**
|
| 61 |
We announce a major milestone for `Open-Reasoner-Zero`:
|
| 62 |
|
| 63 |
+
- π [Updated Paper](https://arxiv.org/abs/2503.24290) with new results.
|
| 64 |
- π [Easy-to-use Training Scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/playground):
|
| 65 |
- [ORZ-1.5B training scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/playground/orz_1p5b_ppo.py) and [ORZ-0.5B training scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/playground/orz_0p5b_ppo.py) (main results in Figure 2).
|
| 66 |
- [Minimal resource training scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/playground/orz_0p5b_ppo_1gpu.py): ORZ-0.5B can be run on a single A800/H800 gpu!
|
|
|
|
| 75 |
- Released HF Models: [`Open-Reasoner-Zero-1.5B`](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-1.5B) and [`Open-Reasoner-Zero-0.5B`](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-0.5B).
|
| 76 |
- π Full Suite of Critic Models for in-depth research: `Open-Reasoner-Zero-Critic-`{[0.5B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-Critic-0.5B), [1.5B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-Critic-1.5B), [7B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-Critic-7B), [32B](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-Critic-32B)}.
|
| 77 |
|
| 78 |
+
**[2025/02/18]**
|
| 79 |
We release `Open-Reasoner-Zero`.
|
| 80 |
|
| 81 |
As part of this release, we open-source:
|
| 82 |
+
- π [Paper](https://arxiv.org/abs/2503.24290) on our comprehensive analysis and insights in Reasoner-Zero training
|
| 83 |
- π€ HF Model [`Open-Reasoner-Zero-7B`](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-7B) and [`Open-Reasoner-Zero-32B`](https://huggingface.co/Open-Reasoner-Zero/Open-Reasoner-Zero-32B)
|
| 84 |
- π [`Our curated 57k training data`](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/data)
|
| 85 |
- π [Training Scripts](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/playground) to enjoy your own Reasoner-Zero journey!
|
|
|
|
| 98 |
* [extended 72k](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/data/orz_math_72k_collection_extended.json), mainly cleaned from OpenR1-Math-220k.
|
| 99 |
* [hard 13k](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/data/orz_math_13k_collection_hard.json), mined from the first stage of ORZ-32B training.
|
| 100 |
|
| 101 |
+
The details for how to collect data are described in our [paper](https://arxiv.org/abs/2503.24290).
|
| 102 |
|
| 103 |
### Installation & Training Scripts
|
| 104 |
We release our [Dockerfile](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/docker/Dockerfile) in [docker](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main/docker) folder to facilitate the reproducibility of our training.
|
|
|
|
| 190 |
DEBUG_MODE=True python -m playground.orz_7b_ppo
|
| 191 |
```
|
| 192 |
|
| 193 |
+
### How to Use the Model
|
| 194 |
+
#### Policy Model
|
| 195 |
+
Policy models can be used in the same way as any chat model in transformers and vllm, since we have put the chat template jinja in the tokenizer.
|
| 196 |
+
|
| 197 |
+
#### Critic Model
|
| 198 |
+
Critic models can be loaded the same way like in the [training code](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/orz/ppo/actors.py#L738).
|
| 199 |
+
|
| 200 |
+
|
| 201 |
## Acknowledgements π
|
| 202 |
|
| 203 |
- This work was supported by computing resources and valuable feedback provided by [StepFun](https://www.stepfun.com/) and Tsinghua University.
|
|
|
|
| 221 |
## Citation
|
| 222 |
|
| 223 |
```bibtex
|
| 224 |
+
@misc{hu2025openreasonerzeroopensourceapproach,
|
| 225 |
+
title={Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model},
|
| 226 |
+
author={Jingcheng Hu and Yinmin Zhang and Qi Han and Daxin Jiang and Xiangyu Zhang and Heung-Yeung Shum},
|
| 227 |
+
year={2025},
|
| 228 |
+
eprint={2503.24290},
|
| 229 |
+
archivePrefix={arXiv},
|
| 230 |
+
primaryClass={cs.LG},
|
| 231 |
+
url={https://arxiv.org/abs/2503.24290},
|
| 232 |
}
|
| 233 |
+
```
|
|
|