File size: 5,714 Bytes
ba15109 b1687f2 ba15109 cbdb08d ba15109 cbdb08d ba15109 593f8de ba15109 cbdb08d ba15109 cbdb08d ba15109 cbdb08d ba15109 b1687f2 ba15109 cbdb08d b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 cbdb08d b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 ba15109 b1687f2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | ---
library_name: transformers
tags:
- reward-model
- prm
- generative reward model
- process supervision
- chain-of-thought
- verification
- math reasoning
- code verification
---
# Model Card for ThinkPRM-1.5B
ThinkPRM-1.5B is a generative Process Reward Model (PRM) based on the R1-Distill-Qwen-1.5B architecture. It is fine-tuned to perform step-by-step verification of reasoning processes (like mathematical solutions) by generating an explicit verification chain-of-thought (CoT) that involves labeling every step. It is designed to be highly data-efficient, requiring significantly less supervision data than traditional discriminative PRMs while achieving strong performance.
Here's an example of the model output:
## Model Details
### Model Description
ThinkPRM-1.5B provides step-level verification scores by generating natural language critiques and correctness judgments for each step in a given solution prefix. It leverages the underlying reasoning capabilities of the base Large Reasoning Model (LRM) and enhances them through fine-tuning on a small (1K examples) dataset of synthetically generated verification CoTs. These synthetic CoTs were produced by prompting QwQ-32B-Preview and filtered against ground-truth step labels from the PRM800K dataset to ensure quality.
The model uses a standard language modeling objective, making it interpretable and allowing it to scale process verification compute by generating longer or multiple verification CoTs. It demonstrated superior performance compared to LLM-as-a-judge and discriminative PRM baselines (based on the same R1-Distill-Qwen-1.5B model but trained on ~100x more labels) on benchmarks including ProcessBench, MATH-500, AIME '24, GPQA-Diamond, and LiveCodeBench.
- **Finetuned from model [optional]:** [R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
### Model Sources [optional]
- **Repository:** [Github](https://github.com/mukhal/thinkprm)
- **Paper:** [Process Reward Models that Think (arXiv:2504.16828)](https://arxiv.org/abs/2504.16828)
### Direct Use
ThinkPRM-1.5B is intended for verifying the correctness of step-by-step reasoning processes. Primary uses include:
- **Scoring Solutions:** Assigning step-level or overall scores to candidate solutions for ranking in Best-of-N sampling or guiding tree search in reasoning tasks.
- **Generating Verification Rationales/CoTs:** Producing detailed chain-of-thought verifications that explain *why* a particular step is correct or incorrect, aiding interpretability.
- **Standalone Verification:** Evaluating the correctness of a given problem-solution pair.
The model has been evaluated on mathematical reasoning (MATH, AIME), scientific QA (GPQA), and code generation (LiveCodeBench). See our paper for more details.
## Limitations
- **Overconfidence:** Generative PRMs like ThinkPRM can sometimes produce scores clustered near 0 or 1, potentially not reflecting true uncertainty
- **Step Label Interference:** The autoregressive nature might cause an early incorrect step judgment to negatively bias the evaluation of subsequent steps.
- **Sensitivity to Formatting/Prompting:** Performance might be sensitive to the exact format of the input solution and the prompt used for verification (though fine-tuning likely reduces this compared to LLM-as-a-judge).
## How to Get Started with the Model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from vllm import LLM, SamplingParams
model_id = "launch/ThinkPRM-1.5B" # Replace with actual model ID on Hub
tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(model=model_id, max_model_len=16384)
# Example problem and solution
problem = "Solve for x: 2x + 3 = 7"
prefix = "Step 1: Subtract 3 from both sides: 2x = 4\nStep 2: Divide by 2: x = 2"
# Format the prompt -- this is the prompt used for the results in the paper.
prompt = f"""You are given a math problem and a proposed step-by-step solution:
[Math Problem]
{problem}
[Solution]
{prefix}
Review and critique each step in the proposed solution to determine whether each step is correct. If the solution is incomplete, only verify the provided steps
"""
prompt = tokenizer.apply_chat_template([
{'role': "user", "content": prompt}
], tokenize=False, add_generation_prompt=True) # construct prompt
# Set sampling parameters
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=4096,
stop=None
)
# Generate the verification
outputs = llm.generate(prompt, sampling_params)
verification_cot = outputs[0].outputs[0].text
print(verification_cot)
# Alright, I've got this math problem to solve: 2x + 3 = 7. And there's a proposed solution in multiple steps that I need to review and critique.
# ...
# Yes, that checks out. So, the solution is correct.
# Alternatively, I could think about it differently. Maybe I could first divide both sides by 2, but that would give me x + 1.5 = 3.5, and then subtract 1.5 from both sides to get x = 2. But that's more complicated than the given solution.
# The given solution is straightforward and efficient.
# So, in conclusion, both steps are correct, and the solution is accurate.
# But according to the instructions, since both steps are correct, I don't need to go into more detail. I just need to state that each step is correct.
# So, my output should be:
# Step 1: The step is to subtract 3 from both sides, resulting in 2x = 4. This is a correct operation to isolate the term with x. The step is \boxed{correct}
#
## Step 2: The step is to divide both sides by 2, resulting in x = 2. This is the correct operation to solve for x. The step is \boxed{correct}
# </think>
# Is the solution correct? Yes |