Jian Zhao
commited on
Update README.md
Browse files
README.md
CHANGED
|
@@ -3,5 +3,94 @@ license: mit
|
|
| 3 |
datasets:
|
| 4 |
- GenPRM/GenPRM-MATH-Data
|
| 5 |
base_model:
|
| 6 |
-
- deepseek-ai/DeepSeek-R1-Distill-Qwen-
|
| 7 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
datasets:
|
| 4 |
- GenPRM/GenPRM-MATH-Data
|
| 5 |
base_model:
|
| 6 |
+
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# Introduction
|
| 10 |
+
We introduce GenPRM, a generative process reward model (PRM) designed to enhance process supervision performance through explicit Chain-of-Thought (CoT) reasoning and code verification. Addressing critical limitations of prior PRMs—including limited process supervision and scalability. GenPRM pioneers a novel paradigm that leverages the generative capabilities of LLMs to perform step-wise reasoning validation.
|
| 11 |
+
|
| 12 |
+
GenPRM achieves state-of-the-art performance across multiple benchmarks in two key roles:
|
| 13 |
+
- As a verifier: GenPRM-7B outperforms all classification-based PRMs of comparable size and even surpasses Qwen2.5-Math-PRM-72B via test-time scaling.
|
| 14 |
+
- As a critic: GenPRM-7B demonstrates superior critique capabilities, achieving 3.4× greater performance gains than DeepSeekR1-Distill-Qwen-7B after 3 refinement iterations.
|
| 15 |
+
|
| 16 |
+

|
| 17 |
+
|
| 18 |
+
- Project Page: [GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning](https://ryanliu112.github.io/GenPRM/)
|
| 19 |
+
- Paper: [https://arxiv.org/abs/2504.00891](https://arxiv.org/abs/2504.00891)
|
| 20 |
+
- Code: [https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM)
|
| 21 |
+
|
| 22 |
+
# Model details
|
| 23 |
+
For full training details, please refer to our [paper](https://arxiv.org/abs/2504.00891)
|
| 24 |
+
- Training data: the 23K conversation data are released in [GenPRM-MATH-Data](https://huggingface.co/datasets/GenPRM/GenPRM-MATH-Data).
|
| 25 |
+
- Base model: we select the [DeepSeek-R1-Distill series](https://huggingface.co/deepseek-ai) (1.5B, 7B, 32B) as our base models
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
# How to use
|
| 29 |
+
The evaluation and testing code for GenPRM are available in our GitHub repository: [https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM)
|
| 30 |
+
|
| 31 |
+
Here's a minimal example of using VLLM for GenPRM rationale generation:
|
| 32 |
+
```python
|
| 33 |
+
from transformers import AutoTokenizer
|
| 34 |
+
from vllm import LLM, SamplingParams
|
| 35 |
+
|
| 36 |
+
# Load model and tokenizer
|
| 37 |
+
model = LLM(model="GenPRM/GenPRM-7B")
|
| 38 |
+
tokenizer = AutoTokenizer.from_pretrained("GenPRM/GenPRM-7B")
|
| 39 |
+
|
| 40 |
+
# Configure sampling parameters
|
| 41 |
+
sampling_params = SamplingParams(
|
| 42 |
+
temperature=0.6,
|
| 43 |
+
top_p=0.95,
|
| 44 |
+
max_tokens=8192,
|
| 45 |
+
top_k=20,
|
| 46 |
+
repetition_penalty=1.0
|
| 47 |
+
)
|
| 48 |
+
|
| 49 |
+
# Define the messages
|
| 50 |
+
messages = [
|
| 51 |
+
{'role': 'system', 'content': 'You are a math teacher. Your task is to review and critique the paragraphs in solution step by step.'},
|
| 52 |
+
{'role': 'user', 'content': 'Question: Let $f(x)=x^2-7x+18$ and let $g(f(x))=2x+3$. What is the sum of all possible values of $g(8)$?\n\nTo solve the problem, we need to first understand the given functions and how they interact with each other. We are given $f(x) = x^2 - 7x + 18$ and $g(f(x)) = 2x + 3$.'}
|
| 53 |
+
]
|
| 54 |
+
|
| 55 |
+
# Generate prompt and get the model's output
|
| 56 |
+
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 57 |
+
outputs = model.generate(prompt, sampling_params)
|
| 58 |
+
|
| 59 |
+
# Print result
|
| 60 |
+
print(f"Model output for the first solution step: {outputs[0].outputs[0].text}")
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
# Citation
|
| 64 |
+
If you find this work helpful, please kindly cite our paper:
|
| 65 |
+
|
| 66 |
+
```bibtex
|
| 67 |
+
@article{zhao2025genprm,
|
| 68 |
+
title = {GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning},
|
| 69 |
+
author = {Jian Zhao and Runze Liu and Kaiyan Zhang and Zhimu Zhou and Junqi Gao and Dong Li and Jiafei Lyu and Zhouyi Qian and Biqing Qi and Xiu Li and Bowen Zhou},
|
| 70 |
+
journal = {arXiv preprint arXiv:2504.00891},
|
| 71 |
+
year = {2025}
|
| 72 |
+
}
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
Our collection of PRMs in [Awesome-Process-Reward-Models](https://github.com/RyanLiu112/Awesome-Process-Reward-Models):
|
| 76 |
+
|
| 77 |
+
```bibtex
|
| 78 |
+
@misc{Awesome-Process-Reward-Models,
|
| 79 |
+
title = {Awesome Process Reward Models},
|
| 80 |
+
author = {Runze Liu and Jian Zhao and Kaiyan Zhang and Zhimu Zhou and Junqi Gao and Dong Li and Jiafei Lyu and Zhouyi Qian and Biqing Qi and Xiu Li and Bowen Zhou},
|
| 81 |
+
howpublished = {\url{https://github.com/RyanLiu112/Awesome-Process-Reward-Models}},
|
| 82 |
+
note = {GitHub repository},
|
| 83 |
+
year = {2025}
|
| 84 |
+
}
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
Our recent work on LLM test-time scaling with PRMs:
|
| 88 |
+
|
| 89 |
+
```bibtex
|
| 90 |
+
@article{liu2025can,
|
| 91 |
+
title = {Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling},
|
| 92 |
+
author = {Runze Liu and Junqi Gao and Jian Zhao and Kaiyan Zhang and Xiu Li and Biqing Qi and Wanli Ouyang and Bowen Zhou},
|
| 93 |
+
journal = {arXiv preprint arXiv:2502.06703},
|
| 94 |
+
year = {2025}
|
| 95 |
+
}
|
| 96 |
+
```
|