Safetensors
English
qwen2
RyanLiu112 commited on
Commit
bd5b1b8
·
verified ·
1 Parent(s): 591ddab

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - GenPRM/GenPRM-MATH-Data
5
+ base_model:
6
+ - deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
7
+ ---
8
+
9
+ # Introduction
10
+
11
+ We propose **GenPRM**, a strong generative process reward model with the following features:
12
+
13
+ - performing explicit **CoT reasoning** and **code verfication** before providing the process judgment;
14
+ - improving Monte Carlo estimation and hard label with **Relative Progress Estimation (RPE)**;
15
+ - supporting GenPRM **test-time scaling** in a parallel manner with majority voting;
16
+ - supporting policy model test-time scaling with GenPRM as **verifiers** or **critics**.
17
+
18
+ GenPRM achieves state-of-the-art performance across multiple benchmarks in two key roles:
19
+
20
+ - **As a verifier**: GenPRM-7B outperforms all classification-based PRMs of comparable size and even surpasses **Qwen2.5-Math-PRM-72B** via test-time scaling.
21
+ - **As a critic**: GenPRM-7B demonstrates superior critique capabilities, achieving **3.4×** greater performance gains than DeepSeekR1-Distill-Qwen-7B after 3 refinement iterations.
22
+
23
+ ![](images/fig_head.png)
24
+
25
+ - Project Page: [GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning](https://ryanliu112.github.io/GenPRM)
26
+ - Paper: [https://arxiv.org/abs/2504.00891](https://arxiv.org/abs/2504.00891)
27
+ - Code: [https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM)
28
+ - Awesome Process Reward Models: [Awesome Process Reward Models](https://github.com/RyanLiu112/Awesome-Process-Reward-Models)
29
+ - HF Paper Link: [GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning](https://hf.co/papers/2504.00891)
30
+ - HF Collection: [GenPRM](https://hf.co/collections/GenPRM/genprm-67ee4936234ba5dd16bb9943)
31
+
32
+ # Model details
33
+
34
+ For full training details, please refer to our [paper](https://arxiv.org/abs/2504.00891).
35
+
36
+ - Training data: 23K SFT data is released in [GenPRM-MATH-Data](https://huggingface.co/datasets/GenPRM/GenPRM-MATH-Data).
37
+ - Base model: we use [DeepSeek-R1-Distill series](https://huggingface.co/deepseek-ai) (1.5B, 7B, and 32B) as our base models.
38
+
39
+ # How to use
40
+
41
+ The evaluation code of GenPRM is available in our GitHub repository: [https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM).
42
+
43
+ Here's a minimal example of using GenPRM for rationale generation and process supervision:
44
+
45
+ ```python
46
+ from transformers import AutoTokenizer
47
+ from vllm import LLM, SamplingParams
48
+
49
+ # Load model and tokenizer
50
+ model = LLM(model="GenPRM/GenPRM-32B")
51
+ tokenizer = AutoTokenizer.from_pretrained("GenPRM/GenPRM-32B")
52
+
53
+ # Configure sampling parameters
54
+ sampling_params = SamplingParams(
55
+ temperature=0.6,
56
+ top_p=0.95,
57
+ max_tokens=8192,
58
+ top_k=20,
59
+ repetition_penalty=1.0
60
+ )
61
+
62
+ # Define the messages
63
+ messages = [
64
+ {'role': 'system', 'content': 'You are a math teacher. Your task is to review and critique the paragraphs in solution step by step.'},
65
+ {'role': 'user', 'content': 'Question: Let $f(x)=x^2-7x+18$ and let $g(f(x))=2x+3$. What is the sum of all possible values of $g(8)$?\n\nTo solve the problem, we need to first understand the given functions and how they interact with each other. We are given $f(x) = x^2 - 7x + 18$ and $g(f(x)) = 2x + 3$.'}
66
+ ]
67
+
68
+ # Generate prompt and get the model's output
69
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
70
+ outputs = model.generate(prompt, sampling_params)
71
+
72
+ # Print result
73
+ print(f"Model output for the first solution step: {outputs[0].outputs[0].text}")
74
+ ```
75
+
76
+ # Citation
77
+ If you find this work helpful, please kindly cite our paper:
78
+
79
+ ```bibtex
80
+ @article{zhao2025genprm,
81
+ title = {GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning},
82
+ author = {Jian Zhao and Runze Liu and Kaiyan Zhang and Zhimu Zhou and Junqi Gao and Dong Li and Jiafei Lyu and Zhouyi Qian and Biqing Qi and Xiu Li and Bowen Zhou},
83
+ journal = {arXiv preprint arXiv:2504.00891},
84
+ year = {2025}
85
+ }
86
+ ```
87
+
88
+ Our collection of PRMs in [Awesome-Process-Reward-Models](https://github.com/RyanLiu112/Awesome-Process-Reward-Models):
89
+
90
+ ```bibtex
91
+ @misc{Awesome-Process-Reward-Models,
92
+ title = {Awesome Process Reward Models},
93
+ author = {Runze Liu and Jian Zhao and Kaiyan Zhang and Zhimu Zhou and Junqi Gao and Dong Li and Jiafei Lyu and Zhouyi Qian and Biqing Qi and Xiu Li and Bowen Zhou},
94
+ howpublished = {\url{https://github.com/RyanLiu112/Awesome-Process-Reward-Models}},
95
+ note = {GitHub repository},
96
+ year = {2025}
97
+ }
98
+ ```
99
+
100
+ Our recent work on LLM test-time scaling with PRMs:
101
+
102
+ ```bibtex
103
+ @article{liu2025can,
104
+ title = {Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling},
105
+ author = {Runze Liu and Junqi Gao and Jian Zhao and Kaiyan Zhang and Xiu Li and Biqing Qi and Wanli Ouyang and Bowen Zhou},
106
+ journal = {arXiv preprint arXiv:2502.06703},
107
+ year = {2025}
108
+ }
109
+ ```