File size: 7,724 Bytes
d72a8ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
base_model: Qwen/Qwen2.5-Math-1.5B-Instruct
library_name: transformers
model_name: Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM
tags:
- generated_from_trainer
- prm
- trl
- math
- process-reward-model
- qwen2.5
- sharp
---

# Model Card for Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM

## Introduction

**Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM** is a Process Reward Model (PRM) fine-tuned from [Qwen2.5-Math-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B-Instruct). This model is specifically designed to evaluate the correctness of intermediate reasoning steps in mathematical problem-solving processes, enabling more reliable and interpretable mathematical reasoning.

The model has been trained on the **SHARP-Math** dataset using the Process Reward Model methodology, which provides step-by-step feedback on mathematical reasoning chains.

This model is part of the SHARP-PRM series, trained using advanced Process Reward Model techniques.

## Model Information

### Base Model
- **Base Model**: [Qwen/Qwen2.5-Math-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B-Instruct)
- **Architecture**: Qwen2ForTokenClassification
- **Parameters**: 1.5B

### Training Details
- **Training Dataset**: SHARP-Math (Process Reward Model dataset)
- **Training Method**: Process Reward Model (PRM) as introduced in [Uesato et al., 2022](https://huggingface.co/papers/2211.14275)
- **Training Framework**: [TRL (Transformer Reinforcement Learning)](https://github.com/huggingface/trl) v0.24.0
- **Task Type**: Token Classification (binary classification: error/correct for each reasoning step)

## PRM Evaluation

This model is designed to evaluate mathematical reasoning processes by:
1. **Step-level Evaluation**: Classifying each step in a reasoning chain as either "correct" or "error"
2. **Process Feedback**: Providing feedback on the reasoning process, not just the final answer
3. **Error Detection**: Identifying where mistakes occur in multi-step mathematical solutions

### Evaluation Metrics
The model is evaluated on the [ProcessBench](https://huggingface.co/datasets/Qwen/ProcessBench) benchmark.

Key metrics include:
- **Error Accuracy**: Ability to correctly identify incorrect steps
- **Correct Accuracy**: Ability to correctly identify correct steps
- **F1 Score**: Balanced measure of error and correct step classification

## Quick Start

### Installation

```bash
pip install transformers torch
```

### Basic Usage

#### Using the Model for Step Classification

```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "path/to/Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

# Example: Evaluate a mathematical reasoning chain
# Problem with steps (one correct, one incorrect)
problem = "Solve: 2x + 5 = 13"
steps = [
    "Subtract 5 from both sides: 2x = 8",  # Correct step
    "Divide by 2: x = 5"  # Incorrect step (should be x = 4)
]

# Format input with step separator
input_text = problem + "\n\n" + "\n\n".join(steps)
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=8192)

# Get model predictions
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # Shape: [batch_size, sequence_length, num_labels]
    probabilities = F.softmax(logits, dim=-1)  # Convert to probabilities
    predictions = torch.argmax(logits, dim=-1)  # Get predicted class indices

# Aggregate predictions per step
# In practice, you would map tokens to steps based on your step separator
labels = ["error", "correct"]
for i, step in enumerate(steps):
    # Get average probability for step tokens (simplified)
    # In real usage, you'd need to map token positions to step boundaries
    step_start = len(tokenizer(problem + "\n\n", return_tensors="pt")["input_ids"][0])
    step_tokens = predictions[0, step_start:step_start+len(tokenizer(step)["input_ids"])]
    step_label = labels[step_tokens.mode().values.item()] if len(step_tokens) > 0 else "unknown"
    print(f"\nStep {i+1}: {step}")
    print(f"  Prediction: {step_label}")
    print(f"  Confidence: {probabilities[0, step_start, 1].item():.2%}")

# Expected output:
# Step 1: Subtract 5 from both sides: 2x = 8
#   Prediction: correct
#   Confidence: 0.95
#
# Step 2: Divide by 2: x = 5
#   Prediction: error
#   Confidence: 0.87
```

**Output Interpretation:**

- **Logits**: Raw scores from the model (before softmax). Higher values indicate stronger confidence.
- **Probabilities**: Softmax-normalized scores between 0 and 1. Sum to 1 for each token.
- **Predictions**: Class indices (0 = "error", 1 = "correct") for each token.

#### Using with Pipeline

```python
from transformers import pipeline

classifier = pipeline(
    "token-classification",
    model="path/to/Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM",
    tokenizer="path/to/Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM",
    device=0 if torch.cuda.is_available() else -1
)

# Classify reasoning steps
result = classifier(problem + "\n\n" + "\n\n".join(steps))
```

### Integration with Mathematical Reasoning

This PRM model can be used to:
1. **Filter incorrect reasoning paths** in tree-of-thought or chain-of-thought generation
2. **Provide feedback** during step-by-step problem solving
3. **Evaluate solution quality** before final answer generation
4. **Improve training** by identifying problematic reasoning patterns

## Training Procedure

### Training Configuration

- **Learning Rate**: 2e-5
- **Batch Size**: Per-device batch size (with gradient accumulation)
- **Epochs**: Multiple epochs with early stopping
- **Optimizer**: AdamW with cosine learning rate schedule
- **Warmup Ratio**: 3%
- **Gradient Clipping**: 5.0
- **Precision**: bfloat16
- **Gradient Checkpointing**: Enabled for memory efficiency

### Training Framework Versions

- **TRL**: 0.24.0
- **Transformers**: 4.56.2
- **PyTorch**: 2.9.1
- **Datasets**: 4.4.1
- **Tokenizers**: 0.22.1

### Training Data

The model was trained on the **SHARP-Math** dataset, which contains:
- Mathematical problems with step-by-step solutions
- Labeled reasoning steps (correct/error)
- Diverse mathematical domains and difficulty levels

## Use Cases

### 1. Mathematical Reasoning Evaluation
- Evaluate intermediate steps in mathematical problem-solving
- Identify errors in multi-step calculations
- Provide feedback on reasoning quality

### 2. Educational Applications
- Automated grading of mathematical solutions
- Step-by-step feedback for students
- Identification of common error patterns

### 3. Research Applications
- Training better mathematical reasoning models
- Analyzing reasoning patterns
- Improving chain-of-thought generation

## Limitations and Considerations

1. **Domain Specificity**: This model is specifically trained for mathematical reasoning and may not generalize well to other domains
2. **Step Length**: The model is optimized for step-level evaluation with a 256-token context per step
3. **Language**: The model is primarily trained on English mathematical content
4. **False Positives/Negatives**: Like all classification models, it may misclassify some steps

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{qwen2.5-math-1.5b-instruct-sharp-math-prm,
  title={Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM: A Process Reward Model for Mathematical Reasoning},
  author={Your Name/Organization},
  year={2025},
  howpublished={\url{https://huggingface.co/path/to/Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM}}
}
```

**Model Card Version**: 1.0  
**Last Updated**: 2025-12-30