| | --- |
| | base_model: Qwen/Qwen2.5-Math-7B-Instruct |
| | library_name: transformers |
| | model_name: Qwen2.5-Math-7B-Instruct-PRM800K-SHARP-PRM |
| | tags: |
| | - generated_from_trainer |
| | - prm |
| | - trl |
| | - math |
| | - process-reward-model |
| | - qwen2.5 |
| | - sharp |
| | --- |
| | |
| | # Model Card for Qwen2.5-Math-7B-Instruct-PRM800K-SHARP-PRM |
| |
|
| | ## Introduction |
| |
|
| | **Qwen2.5-Math-7B-Instruct-PRM800K-SHARP-PRM** is a Process Reward Model (PRM) fine-tuned from [Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct). This model is specifically designed to evaluate the correctness of intermediate reasoning steps in mathematical problem-solving processes, enabling more reliable and interpretable mathematical reasoning. |
| |
|
| | The model has been trained on the **PRM800K** dataset using the Process Reward Model methodology, which provides step-by-step feedback on mathematical reasoning chains. |
| |
|
| | This model is part of the SHARP-PRM series, trained using advanced Process Reward Model techniques. |
| |
|
| | ## Model Information |
| |
|
| | ### Base Model |
| | - **Base Model**: [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) |
| | - **Architecture**: Qwen2ForTokenClassification |
| | - **Parameters**: 7B |
| |
|
| | ### Training Details |
| | - **Training Dataset**: PRM800K (Process Reward Model dataset with 800K examples) |
| | - **Training Method**: Process Reward Model (PRM) as introduced in [Uesato et al., 2022](https://huggingface.co/papers/2211.14275) |
| | - **Training Framework**: [TRL (Transformer Reinforcement Learning)](https://github.com/huggingface/trl) v0.24.0 |
| | - **Task Type**: Token Classification (binary classification: error/correct for each reasoning step) |
| |
|
| | ## PRM Evaluation |
| |
|
| | This model is designed to evaluate mathematical reasoning processes by: |
| | 1. **Step-level Evaluation**: Classifying each step in a reasoning chain as either "correct" or "error" |
| | 2. **Process Feedback**: Providing feedback on the reasoning process, not just the final answer |
| | 3. **Error Detection**: Identifying where mistakes occur in multi-step mathematical solutions |
| |
|
| | ### Evaluation Metrics |
| | The model is evaluated on the [ProcessBench](https://huggingface.co/datasets/Qwen/ProcessBench) benchmark. |
| |
|
| | Key metrics include: |
| | - **Error Accuracy**: Ability to correctly identify incorrect steps |
| | - **Correct Accuracy**: Ability to correctly identify correct steps |
| | - **F1 Score**: Balanced measure of error and correct step classification |
| |
|
| | ## Quick Start |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install transformers torch |
| | ``` |
| |
|
| | ### Basic Usage |
| |
|
| | #### Using the Model for Step Classification |
| |
|
| | ```python |
| | from transformers import AutoModelForTokenClassification, AutoTokenizer |
| | import torch |
| | import torch.nn.functional as F |
| | |
| | model_name = "path/to/Qwen2.5-Math-7B-Instruct-PRM800K-SHARP-PRM" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForTokenClassification.from_pretrained(model_name) |
| | model.eval() |
| | |
| | # Example: Evaluate a mathematical reasoning chain |
| | # Problem with steps (one correct, one incorrect) |
| | problem = "Solve: 2x + 5 = 13" |
| | steps = [ |
| | "Subtract 5 from both sides: 2x = 8", # Correct step |
| | "Divide by 2: x = 5" # Incorrect step (should be x = 4) |
| | ] |
| | |
| | # Format input with step separator |
| | input_text = problem + "\n\n" + "\n\n".join(steps) |
| | inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=8192) |
| | |
| | # Get model predictions |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | logits = outputs.logits # Shape: [batch_size, sequence_length, num_labels] |
| | probabilities = F.softmax(logits, dim=-1) # Convert to probabilities |
| | predictions = torch.argmax(logits, dim=-1) # Get predicted class indices |
| | |
| | # Aggregate predictions per step |
| | # In practice, you would map tokens to steps based on your step separator |
| | labels = ["error", "correct"] |
| | for i, step in enumerate(steps): |
| | # Get average probability for step tokens (simplified) |
| | # In real usage, you'd need to map token positions to step boundaries |
| | step_start = len(tokenizer(problem + "\n\n", return_tensors="pt")["input_ids"][0]) |
| | step_tokens = predictions[0, step_start:step_start+len(tokenizer(step)["input_ids"])] |
| | step_label = labels[step_tokens.mode().values.item()] if len(step_tokens) > 0 else "unknown" |
| | print(f"\nStep {i+1}: {step}") |
| | print(f" Prediction: {step_label}") |
| | print(f" Confidence: {probabilities[0, step_start, 1].item():.2%}") |
| | |
| | # Expected output: |
| | # Step 1: Subtract 5 from both sides: 2x = 8 |
| | # Prediction: correct |
| | # Confidence: 0.95 |
| | # |
| | # Step 2: Divide by 2: x = 5 |
| | # Prediction: error |
| | # Confidence: 0.87 |
| | ``` |
| |
|
| | **Output Interpretation:** |
| |
|
| | - **Logits**: Raw scores from the model (before softmax). Higher values indicate stronger confidence. |
| | - **Probabilities**: Softmax-normalized scores between 0 and 1. Sum to 1 for each token. |
| | - **Predictions**: Class indices (0 = "error", 1 = "correct") for each token. |
| |
|
| | #### Using with Pipeline |
| |
|
| | ```python |
| | from transformers import pipeline |
| | |
| | classifier = pipeline( |
| | "token-classification", |
| | model="path/to/Qwen2.5-Math-7B-Instruct-PRM800K-SHARP-PRM", |
| | tokenizer="path/to/Qwen2.5-Math-7B-Instruct-PRM800K-SHARP-PRM", |
| | device=0 if torch.cuda.is_available() else -1 |
| | ) |
| | |
| | # Classify reasoning steps |
| | result = classifier(problem + "\n\n" + "\n\n".join(steps)) |
| | ``` |
| |
|
| | ### Integration with Mathematical Reasoning |
| |
|
| | This PRM model can be used to: |
| | 1. **Filter incorrect reasoning paths** in tree-of-thought or chain-of-thought generation |
| | 2. **Provide feedback** during step-by-step problem solving |
| | 3. **Evaluate solution quality** before final answer generation |
| | 4. **Improve training** by identifying problematic reasoning patterns |
| |
|
| | ## Training Procedure |
| |
|
| | ### Training Configuration |
| |
|
| | - **Learning Rate**: 2e-5 |
| | - **Batch Size**: Per-device batch size (with gradient accumulation) |
| | - **Epochs**: Multiple epochs with early stopping |
| | - **Optimizer**: AdamW with cosine learning rate schedule |
| | - **Warmup Ratio**: 3% |
| | - **Gradient Clipping**: 5.0 |
| | - **Precision**: bfloat16 |
| | - **Gradient Checkpointing**: Enabled for memory efficiency |
| |
|
| | ### Training Framework Versions |
| |
|
| | - **TRL**: 0.24.0 |
| | - **Transformers**: 4.56.2 |
| | - **PyTorch**: 2.9.1 |
| | - **Datasets**: 4.4.1 |
| | - **Tokenizers**: 0.22.1 |
| |
|
| | ### Training Data |
| |
|
| | The model was trained on the **PRM800K** dataset, which contains: |
| | - Mathematical problems with step-by-step solutions |
| | - Labeled reasoning steps (correct/error) |
| | - Diverse mathematical domains and difficulty levels |
| |
|
| | ## Use Cases |
| |
|
| | ### 1. Mathematical Reasoning Evaluation |
| | - Evaluate intermediate steps in mathematical problem-solving |
| | - Identify errors in multi-step calculations |
| | - Provide feedback on reasoning quality |
| |
|
| | ### 2. Educational Applications |
| | - Automated grading of mathematical solutions |
| | - Step-by-step feedback for students |
| | - Identification of common error patterns |
| |
|
| | ### 3. Research Applications |
| | - Training better mathematical reasoning models |
| | - Analyzing reasoning patterns |
| | - Improving chain-of-thought generation |
| |
|
| | ## Limitations and Considerations |
| |
|
| | 1. **Domain Specificity**: This model is specifically trained for mathematical reasoning and may not generalize well to other domains |
| | 2. **Step Length**: The model is optimized for step-level evaluation with a 256-token context per step |
| | 3. **Language**: The model is primarily trained on English mathematical content |
| | 4. **False Positives/Negatives**: Like all classification models, it may misclassify some steps |
| |
|
| | ## Citation |
| |
|
| | If you use this model in your research, please cite: |
| |
|
| | ```bibtex |
| | @misc{qwen2.5-math-7b-instruct-prm800k-sharp-prm, |
| | title={Qwen2.5-Math-7B-Instruct-PRM800K-SHARP-PRM: A Process Reward Model for Mathematical Reasoning}, |
| | author={Your Name/Organization}, |
| | year={2025}, |
| | howpublished={\url{https://huggingface.co/path/to/Qwen2.5-Math-7B-Instruct-PRM800K-SHARP-PRM}} |
| | } |
| | ``` |
| |
|
| | **Model Card Version**: 1.0 |
| | **Last Updated**: 2025-12-30 |
| |
|