README.md · launch/ThinkPRM-1.5B at cbdb08de85e5290ee3357bccaa604cd2ed5f5cbc

File size: 5,714 Bytes

ba15109
 
b1687f2
 
 
 
 
 
 
 
 
ba15109
 
cbdb08d
ba15109
cbdb08d
ba15109
593f8de
 
 
ba15109
 
 
 
cbdb08d
ba15109
cbdb08d
ba15109
cbdb08d
ba15109
 
 
b1687f2
 
ba15109
 
 
 
cbdb08d
b1687f2
 
 
ba15109
b1687f2
ba15109
b1687f2
ba15109
b1687f2
 
 
ba15109
 
b1687f2
 
 
ba15109
cbdb08d
b1687f2
 
ba15109
b1687f2
 
 
ba15109
b1687f2
 
ba15109
b1687f2
ba15109
b1687f2
ba15109
b1687f2
ba15109
b1687f2
ba15109
b1687f2
 
ba15109
b1687f2
 
 
ba15109
b1687f2
 
 
 
 
 
ba15109
b1687f2
 
 
ba15109
b1687f2
 
 
 
ba15109
b1687f2
ba15109
b1687f2
ba15109
b1687f2
ba15109
b1687f2
ba15109
b1687f2
ba15109
b1687f2
 
 
ba15109
b1687f2

---
library_name: transformers
tags:
- reward-model
- prm
- generative reward model
- process supervision
- chain-of-thought
- verification
- math reasoning
- code verification
---

# Model Card for ThinkPRM-1.5B

ThinkPRM-1.5B is a generative Process Reward Model (PRM) based on the R1-Distill-Qwen-1.5B architecture. It is fine-tuned to perform step-by-step verification of reasoning processes (like mathematical solutions) by generating an explicit verification chain-of-thought (CoT) that involves labeling every step. It is designed to be highly data-efficient, requiring significantly less supervision data than traditional discriminative PRMs while achieving strong performance.

Here's an example of the model output: 


## Model Details

### Model Description

ThinkPRM-1.5B provides step-level verification scores by generating natural language critiques and correctness judgments for each step in a given solution prefix. It leverages the underlying reasoning capabilities of the base Large Reasoning Model (LRM) and enhances them through fine-tuning on a small (1K examples) dataset of synthetically generated verification CoTs. These synthetic CoTs were produced by prompting QwQ-32B-Preview and filtered against ground-truth step labels from the PRM800K dataset to ensure quality.

The model uses a standard language modeling objective, making it interpretable and allowing it to scale process verification compute by generating longer or multiple verification CoTs. It demonstrated superior performance compared to LLM-as-a-judge and discriminative PRM baselines (based on the same R1-Distill-Qwen-1.5B model but trained on ~100x more labels) on benchmarks including ProcessBench, MATH-500, AIME '24, GPQA-Diamond, and LiveCodeBench.

- **Finetuned from model [optional]:** [R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)

### Model Sources [optional]

- **Repository:** [Github](https://github.com/mukhal/thinkprm)
- **Paper:** [Process Reward Models that Think (arXiv:2504.16828)](https://arxiv.org/abs/2504.16828)


### Direct Use

ThinkPRM-1.5B is intended for verifying the correctness of step-by-step reasoning processes. Primary uses include:
- **Scoring Solutions:** Assigning step-level or overall scores to candidate solutions for ranking in Best-of-N sampling or guiding tree search in reasoning tasks.
- **Generating Verification Rationales/CoTs:** Producing detailed chain-of-thought verifications that explain *why* a particular step is correct or incorrect, aiding interpretability.
- **Standalone Verification:** Evaluating the correctness of a given problem-solution pair.

The model has been evaluated on mathematical reasoning (MATH, AIME), scientific QA (GPQA), and code generation (LiveCodeBench). See our paper for more details.

## Limitations

- **Overconfidence:** Generative PRMs like ThinkPRM can sometimes produce scores clustered near 0 or 1, potentially not reflecting true uncertainty
- **Step Label Interference:** The autoregressive nature might cause an early incorrect step judgment to negatively bias the evaluation of subsequent steps.
- **Sensitivity to Formatting/Prompting:** Performance might be sensitive to the exact format of the input solution and the prompt used for verification (though fine-tuning likely reduces this compared to LLM-as-a-judge).

## How to Get Started with the Model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from vllm import LLM, SamplingParams

model_id = "launch/ThinkPRM-1.5B" # Replace with actual model ID on Hub
tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(model=model_id, max_model_len=16384)

# Example problem and solution
problem = "Solve for x: 2x + 3 = 7"
prefix = "Step 1: Subtract 3 from both sides: 2x = 4\nStep 2: Divide by 2: x = 2"

# Format the prompt -- this is the prompt used for the results in the paper.
prompt = f"""You are given a math problem and a proposed step-by-step solution:

[Math Problem]

{problem}

[Solution]

{prefix}

Review and critique each step in the proposed solution to determine whether each step is correct. If the solution is incomplete, only verify the provided steps
"""

prompt = tokenizer.apply_chat_template([
    {'role': "user", "content": prompt}
], tokenize=False, add_generation_prompt=True) # construct prompt 

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=4096,
    stop=None
)

# Generate the verification
outputs = llm.generate(prompt, sampling_params)
verification_cot = outputs[0].outputs[0].text

print(verification_cot)
# Alright, I've got this math problem to solve: 2x + 3 = 7. And there's a proposed solution in multiple steps that I need to review and critique.
# ...
# Yes, that checks out. So, the solution is correct.

# Alternatively, I could think about it differently. Maybe I could first divide both sides by 2, but that would give me x + 1.5 = 3.5, and then subtract 1.5 from both sides to get x = 2. But that's more complicated than the given solution.

# The given solution is straightforward and efficient.

# So, in conclusion, both steps are correct, and the solution is accurate.

# But according to the instructions, since both steps are correct, I don't need to go into more detail. I just need to state that each step is correct.

# So, my output should be:

# Step 1: The step is to subtract 3 from both sides, resulting in 2x = 4. This is a correct operation to isolate the term with x. The step is \boxed{correct}
#
## Step 2: The step is to divide both sides by 2, resulting in x = 2. This is the correct operation to solve for x. The step is \boxed{correct}

# </think>
# Is the solution correct? Yes