--- license: apache-2.0 datasets: - openai/gsm8k language: - en metrics: - accuracy base_model: - arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3 library_name: transformers tags: - RL - GRPO - Math --- # MathReasoner-Mini-1.5b
đ¨ We recommend using this model for High school level math problems. It works better to ask the question in English. We do not advise using it for other tasks.
đ Colab notebook for inference
## Introduction This is a reasoning model trained on top of Qwen2.5-Math-1.5B-base and has been trained in **Three stages (SFT, DPO and GRPO)**, to progressively improve **mathematical reasoning** with **structured outputs** on **GSM8K** dataset, a benchmark targeting school level math problems. ## Evaluation (GSM8K Pass@1 Zero shot)
| ModelPass@1 | Math Accuracy % |
| --------------------------------------- | ----------- |
| Base Qwen2.5-Math-1.5B | 54% |
| After SFT | 67.5% |
| After SFT + DPO | 70% |
| **After SFT + DPO + GRPO (MathReasoner-Mini-1.5b)** | **~83.7%** |
Evaluation was run on GSM8K test with: `temperature=0.3`, `top_p=1.0`,
XML structured output accuracy improved from 71% (Qwen2.5-1.5B-base) to **99%** (MathReasoner-Mini-1.5b)
*MathReasoner's pass@8 math accuracy is **94.1%** showing that there still more improvement possible on scaling RL*
*Accuracies shown above take structured output format in consideration, requiring reasoning to be enclosed within think tags and numerical answer between answer tags*
## Training Stages
### **Stage 1 â Supervised Fine-Tuning (SFT)**
Checkpoint: [**arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10**](https://huggingface.co/aryan-kolapkar/Qwen-2.5_1.5b_MATH_GSM8K_SFT10)
* Dataset: curated GSM8K subset with self-verified generations
* Epochs: 10
* LR: 3e-6
* Batch size: 4
* Gradient accumulation: 4
* Only correct & wellâformatted CoT samples used to minimize model entropy
### **Stage 2 â Direct Preference Optimization (DPO)**
Checkpoint: [**arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3**](https://huggingface.co/aryan-kolapkar/Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3)
* Dataset: ~1,000 preference pairs
* Mostly hard pairs (correct vs incorrect)
* Some soft preferences (shorter correct CoT)
* For each GSM8K problem, 4 samples were generated â chosen = correct, rejected = incorrect
* Epochs: 3
* β = 0.1, LR = 3eâ6
### **Stage 3 â GRPO Reinforcement Learning**
This model was further trained with GRPO on GSM8K train split.
* Steps: 400
* Loss type: DAPO
* Rollouts per prompt: 4
* Gradient accumulation : 8
* Custom reward: format strictness + correctness
* vLLM enabled rollout, TRL trainer
## Prompt Template
```python
def prompt_input(question):
prompt = f'''A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within