File size: 2,390 Bytes
bef21c5
 
 
55682d0
 
4a1e563
55682d0
 
835c480
55682d0
 
 
 
 
 
4a1e563
55682d0
 
 
 
 
 
 
4a1e563
55682d0
 
 
 
 
 
 
4a1e563
55682d0
 
 
 
 
4a1e563
 
 
 
 
e28ee1b
4a1e563
 
 
55682d0
 
4a1e563
55682d0
 
 
 
 
 
835c480
55682d0
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
license: mit
library_name: transformers
---


# ReasonFlux-PRM

[Code](https://github.com/Gen-Verse/ReasonFlux) | [Paper](https://arxiv.org/abs/2506.18896)

We introduce ReasonFlux-PRM, a trajectory-aware process reward model (PRM) explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. ReasonFlux-PRM is able to support both offline and online reward supervision, by selecting high-quality training data for model distillation, providing dense process-level rewards for policy optimization during reinforcement learning, and enabling reward-guided test-time scaling. 

<table>
<tr>
<th>Model</th>
<th>Type</th>
<th>Size</th>
<th>Capabilities</th>
<th>Use Cases</th>
<th>Download</th>
</tr>
<tr>
<td><strong>ReasonFlux-PRM</strong></td>
<td>PRM</td>
<td>7B</td>
<td>• Trajectory-aware scoring<br/>• Online/Offline supervision<br/>• Dense process rewards</td>
<td>Data selection, RL training, Test-time scaling</td>
<td><a href="https://huggingface.co/Gen-Verse/ReasonFlux-PRM-7B">🤗 7B</a></td>
</tr>
<tr>
<td><strong>ReasonFlux-PRM</strong></td>
<td>PRM</td>
<td>1.5B</td>
<td>• Lightweight scoring<br/>• Efficient inference<br/>• Edge deployment</td>
<td>Resource-constrained applications</td>
<td><a href="https://huggingface.co/Gen-Verse/ReasonFlux-PRM-1.5B">🤗 1.5B</a></td>
</tr>
</tr>
<tr>
<td><strong>ReasonFlux-PRM-Qwen-2.5</strong></td>
<td>End-to-End Trained Policy Model</td>
<td>7B</td>
<td>• Long CoT reasoning <br/>• Solving complex tasks and problems</td>
<td>Math and Science Reasoning</td>
<td><a href="https://huggingface.co/Gen-Verse/ReasonFlux-PRM-Qwen-2.5-7B">🤗 7B</a></td>
</tr>
</table>

>*Note: We obtain ReasonFlux-PRM-Qwen-2.5-7B through an end-to-end training process, first applying SFT on 1k Trajectory–Response pairs selected by ReasonFlux-PRM-7B, followed by RL training with ReasonFlux-PRM-7B integrated GRPO.*
## Citation

```bash
@article{zou2025reasonfluxprm,
  title={ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs},
  author={Zou, Jiaru and Yang, Ling and Gu, Jingwen and Qiu, Jiahao and Shen, Ke and He, Jingrui and Wang, Mengdi},
  journal={arXiv preprint arXiv:2506.18896},
  year={2025}
}
```