File size: 7,724 Bytes

d72a8ec

---
base_model: Qwen/Qwen2.5-Math-1.5B-Instruct
library_name: transformers
model_name: Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM
tags:
- generated_from_trainer
- prm
- trl
- math
- process-reward-model
- qwen2.5
- sharp
---

# Model Card for Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM

## Introduction

**Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM** is a Process Reward Model (PRM) fine-tuned from [Qwen2.5-Math-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B-Instruct). This model is specifically designed to evaluate the correctness of intermediate reasoning steps in mathematical problem-solving processes, enabling more reliable and interpretable mathematical reasoning.

The model has been trained on the **SHARP-Math** dataset using the Process Reward Model methodology, which provides step-by-step feedback on mathematical reasoning chains.

This model is part of the SHARP-PRM series, trained using advanced Process Reward Model techniques.

## Model Information

### Base Model
- **Base Model**: [Qwen/Qwen2.5-Math-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B-Instruct)
- **Architecture**: Qwen2ForTokenClassification
- **Parameters**: 1.5B

### Training Details
- **Training Dataset**: SHARP-Math (Process Reward Model dataset)
- **Training Method**: Process Reward Model (PRM) as introduced in [Uesato et al., 2022](https://huggingface.co/papers/2211.14275)
- **Training Framework**: [TRL (Transformer Reinforcement Learning)](https://github.com/huggingface/trl) v0.24.0
- **Task Type**: Token Classification (binary classification: error/correct for each reasoning step)

## PRM Evaluation

This model is designed to evaluate mathematical reasoning processes by:
1. **Step-level Evaluation**: Classifying each step in a reasoning chain as either "correct" or "error"
2. **Process Feedback**: Providing feedback on the reasoning process, not just the final answer
3. **Error Detection**: Identifying where mistakes occur in multi-step mathematical solutions

### Evaluation Metrics
The model is evaluated on the [ProcessBench](https://huggingface.co/datasets/Qwen/ProcessBench) benchmark.

Key metrics include:
- **Error Accuracy**: Ability to correctly identify incorrect steps
- **Correct Accuracy**: Ability to correctly identify correct steps
- **F1 Score**: Balanced measure of error and correct step classification

## Quick Start

### Installation

```bash
pip install transformers torch
```

### Basic Usage

#### Using the Model for Step Classification

```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "path/to/Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

# Example: Evaluate a mathematical reasoning chain
# Problem with steps (one correct, one incorrect)
problem = "Solve: 2x + 5 = 13"
steps = [
    "Subtract 5 from both sides: 2x = 8",  # Correct step
    "Divide by 2: x = 5"  # Incorrect step (should be x = 4)
]

# Format input with step separator
input_text = problem + "\n\n" + "\n\n".join(steps)
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=8192)

# Get model predictions
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # Shape: [batch_size, sequence_length, num_labels]
    probabilities = F.softmax(logits, dim=-1)  # Convert to probabilities
    predictions = torch.argmax(logits, dim=-1)  # Get predicted class indices

# Aggregate predictions per step
# In practice, you would map tokens to steps based on your step separator
labels = ["error", "correct"]
for i, step in enumerate(steps):
    # Get average probability for step tokens (simplified)
    # In real usage, you'd need to map token positions to step boundaries
    step_start = len(tokenizer(problem + "\n\n", return_tensors="pt")["input_ids"][0])
    step_tokens = predictions[0, step_start:step_start+len(tokenizer(step)["input_ids"])]
    step_label = labels[step_tokens.mode().values.item()] if len(step_tokens) > 0 else "unknown"
    print(f"\nStep {i+1}: {step}")
    print(f"  Prediction: {step_label}")
    print(f"  Confidence: {probabilities[0, step_start, 1].item():.2%}")

# Expected output:
# Step 1: Subtract 5 from both sides: 2x = 8
#   Prediction: correct
#   Confidence: 0.95
#
# Step 2: Divide by 2: x = 5
#   Prediction: error
#   Confidence: 0.87
```

**Output Interpretation:**

- **Logits**: Raw scores from the model (before softmax). Higher values indicate stronger confidence.
- **Probabilities**: Softmax-normalized scores between 0 and 1. Sum to 1 for each token.
- **Predictions**: Class indices (0 = "error", 1 = "correct") for each token.

#### Using with Pipeline

```python
from transformers import pipeline

classifier = pipeline(
    "token-classification",
    model="path/to/Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM",
    tokenizer="path/to/Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM",
    device=0 if torch.cuda.is_available() else -1
)

# Classify reasoning steps
result = classifier(problem + "\n\n" + "\n\n".join(steps))
```

### Integration with Mathematical Reasoning

This PRM model can be used to:
1. **Filter incorrect reasoning paths** in tree-of-thought or chain-of-thought generation
2. **Provide feedback** during step-by-step problem solving
3. **Evaluate solution quality** before final answer generation
4. **Improve training** by identifying problematic reasoning patterns

## Training Procedure

### Training Configuration

- **Learning Rate**: 2e-5
- **Batch Size**: Per-device batch size (with gradient accumulation)
- **Epochs**: Multiple epochs with early stopping
- **Optimizer**: AdamW with cosine learning rate schedule
- **Warmup Ratio**: 3%
- **Gradient Clipping**: 5.0
- **Precision**: bfloat16
- **Gradient Checkpointing**: Enabled for memory efficiency

### Training Framework Versions

- **TRL**: 0.24.0
- **Transformers**: 4.56.2
- **PyTorch**: 2.9.1
- **Datasets**: 4.4.1
- **Tokenizers**: 0.22.1

### Training Data

The model was trained on the **SHARP-Math** dataset, which contains:
- Mathematical problems with step-by-step solutions
- Labeled reasoning steps (correct/error)
- Diverse mathematical domains and difficulty levels

## Use Cases

### 1. Mathematical Reasoning Evaluation
- Evaluate intermediate steps in mathematical problem-solving
- Identify errors in multi-step calculations
- Provide feedback on reasoning quality

### 2. Educational Applications
- Automated grading of mathematical solutions
- Step-by-step feedback for students
- Identification of common error patterns

### 3. Research Applications
- Training better mathematical reasoning models
- Analyzing reasoning patterns
- Improving chain-of-thought generation

## Limitations and Considerations

1. **Domain Specificity**: This model is specifically trained for mathematical reasoning and may not generalize well to other domains
2. **Step Length**: The model is optimized for step-level evaluation with a 256-token context per step
3. **Language**: The model is primarily trained on English mathematical content
4. **False Positives/Negatives**: Like all classification models, it may misclassify some steps

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{qwen2.5-math-1.5b-instruct-sharp-math-prm,
  title={Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM: A Process Reward Model for Mathematical Reasoning},
  author={Your Name/Organization},
  year={2025},
  howpublished={\url{https://huggingface.co/path/to/Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM}}
}
```

**Model Card Version**: 1.0  
**Last Updated**: 2025-12-30