| | --- |
| | license: mit |
| | library_name: transformers |
| | --- |
| | |
| |
|
| | # ReasonFlux-PRM |
| |
|
| | [Code](https://github.com/Gen-Verse/ReasonFlux) | [Paper](https://arxiv.org/abs/2506.18896) |
| |
|
| | We introduce ReasonFlux-PRM, a trajectory-aware process reward model (PRM) explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. ReasonFlux-PRM is able to support both offline and online reward supervision, by selecting high-quality training data for model distillation, providing dense process-level rewards for policy optimization during reinforcement learning, and enabling reward-guided test-time scaling. |
| |
|
| | <table> |
| | <tr> |
| | <th>Model</th> |
| | <th>Type</th> |
| | <th>Size</th> |
| | <th>Capabilities</th> |
| | <th>Use Cases</th> |
| | <th>Download</th> |
| | </tr> |
| | <tr> |
| | <td><strong>ReasonFlux-PRM</strong></td> |
| | <td>PRM</td> |
| | <td>7B</td> |
| | <td>• Trajectory-aware scoring<br/>• Online/Offline supervision<br/>• Dense process rewards</td> |
| | <td>Data selection, RL training, Test-time scaling</td> |
| | <td><a href="https://huggingface.co/Gen-Verse/ReasonFlux-PRM-7B">🤗 7B</a></td> |
| | </tr> |
| | <tr> |
| | <td><strong>ReasonFlux-PRM</strong></td> |
| | <td>PRM</td> |
| | <td>1.5B</td> |
| | <td>• Lightweight scoring<br/>• Efficient inference<br/>• Edge deployment</td> |
| | <td>Resource-constrained applications</td> |
| | <td><a href="https://huggingface.co/Gen-Verse/ReasonFlux-PRM-1.5B">🤗 1.5B</a></td> |
| | </tr> |
| | </tr> |
| | <tr> |
| | <td><strong>ReasonFlux-PRM-Qwen-2.5</strong></td> |
| | <td>End-to-End Trained Policy Model</td> |
| | <td>7B</td> |
| | <td>• Long CoT reasoning <br/>• Solving complex tasks and problems</td> |
| | <td>Math and Science Reasoning</td> |
| | <td><a href="https://huggingface.co/Gen-Verse/ReasonFlux-PRM-Qwen-2.5-7B">🤗 7B</a></td> |
| | </tr> |
| | </table> |
| |
|
| | >*Note: We obtain ReasonFlux-PRM-Qwen-2.5-7B through an end-to-end training process, first applying SFT on 1k Trajectory–Response pairs selected by ReasonFlux-PRM-7B, followed by RL training with ReasonFlux-PRM-7B integrated GRPO.* |
| | ## Citation |
| |
|
| | ```bash |
| | @article{zou2025reasonfluxprm, |
| | title={ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs}, |
| | author={Zou, Jiaru and Yang, Ling and Gu, Jingwen and Qiu, Jiahao and Shen, Ke and He, Jingrui and Wang, Mengdi}, |
| | journal={arXiv preprint arXiv:2506.18896}, |
| | year={2025} |
| | } |
| | ``` |