| | --- |
| | license: llama3.2 |
| | datasets: |
| | - OctoThinker/MegaMath-Web-Pro-Max |
| | - LLM360/MegaMath |
| | language: |
| | - en |
| | base_model: |
| | - meta-llama/Llama-3.2-3B |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # [OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling](https://arxiv.org/abs/2506.20512) |
| |
|
| |
|
| |
|
| | ## OctoThinker-3B-Hybrid-Zero |
| |
|
| |
|
| | The OctoThinker family is built on carefully studied mid-training insights, starting from the Llama-3 family, to create a reinforcement learning–friendly base language model. |
| |
|
| | OctoThinker-3B-Hybrid-Zero is trained using the R1-Zero-style reinforcement learning technique, starting from OctoThinker-3B-Hybrid-Base without any supervised fine-tuning (SFT). |
| |
|
| |
|
| | ### Training Recipe for OctoThinker-3B-Hybrid-Base |
| |
|
| | <div style="display: flex; justify-content: left; gap: 20px;"> |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/62cbeb2d72dfd24b86bdf977/XSSllxdLr3dcw250dFm7e.png" alt="Data Pipeline" style="width:90%;"> |
| | </div> |
| |
|
| |
|
| |
|
| |
|
| | ### Evaluation Results of OctoThinker-3B-Base Series |
| |
|
| | Note that we adopt the few-shot prompting evaluation for these base language models. |
| |
|
| |
|
| | <div style="display: flex; justify-content: left; gap: 20px;"> |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/62cbeb2d72dfd24b86bdf977/UCZ9MahRYqLY0iKjiWMqS.png" alt="Data Pipeline" style="width:80%;"> |
| |
|
| | </div> |
| |
|
| |
|
| |
|
| | ### RL Training Dynamics of OctoThinker-3B-Zero Series |
| |
|
| | <div style="display: flex; justify-content: left; gap: 20px;"> |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/62cbeb2d72dfd24b86bdf977/e21Eg8jj_ITxC4YcIJUmx.png" alt="Data Pipeline" style="width:80%;"> |
| | </div> |
| |
|
| |
|
| |
|
| | ### More about OctoThinker |
| |
|
| |
|
| | <div style="display: flex; justify-content: left; gap: 20px;"> |
| | <img src="https://cdn-uploads.huggingface.co/production/uploads/62cbeb2d72dfd24b86bdf977/bn85CEB_DW6azJ7KJp11Q.png" alt="Data Pipeline" style="width:100%;"> |
| | </div> |
| |
|
| |
|
| | ## Citation |
| |
|
| | Check out our [paper](https://arxiv.org/abs/2506.20512) for more details. If you use our models, datasets or find our work useful, please cite |
| |
|
| | ``` |
| | @article{wang2025octothinker, |
| | title={OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling}, |
| | author={Wang, Zengzhi and Zhou, Fan and Li, Xuefeng and Liu, Pengfei}, |
| | year={2025}, |
| | journal={arXiv preprint arXiv:2506.20512}, |
| | note={Preprint} |
| | } |
| | ``` |
| |
|