---
license: apache-2.0
datasets:
- openai/gsm8k
language:
- en
metrics:
- accuracy
base_model:
- arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3
library_name: transformers
tags:
- RL
- GRPO
- Math
---

# MathReasoner-Mini-1.5b

<blockquote style="border-left: 4px solid #ff6b6b; background-color: #fff5f5; padding: 10px 15px; margin: 10px 0; color: #cc3333;">
    <span style="font-weight: bold;">🚨 </span> We recommend using this model for High school level math problems. It works better to ask the question in English. We do not advise using it for other tasks.
</blockquote>

<p align="center">📁 <a href="https://colab.research.google.com/drive/17kZAu9Z5q1WSubTav4-F5zdPR9juOj_t?usp=sharing">Colab notebook for inference</p>


## Introduction

This is a reasoning model trained on top of Qwen2.5-Math-1.5B-base and has been trained in **Three stages (SFT, DPO and GRPO)**, to progressively improve **mathematical reasoning** with **structured outputs** on **GSM8K** dataset, a benchmark targeting school level math problems.

## Evaluation (GSM8K Pass@1 Zero shot)

<img src="./barplot.png" width="450"/>


| ModelPass@1                             |       Math Accuracy %    |
| --------------------------------------- | ----------- |
| Base Qwen2.5-Math-1.5B                  | 54%      |
| After SFT                               | 67.5%   |
| After SFT + DPO                         | 70%    |
| **After SFT + DPO + GRPO (MathReasoner-Mini-1.5b)** | **~83.7%**  |

Evaluation was run on GSM8K test with: `temperature=0.3`, `top_p=1.0`,

XML structured output accuracy improved from 71% (Qwen2.5-1.5B-base) to **99%** (MathReasoner-Mini-1.5b)

*MathReasoner's pass@8 math accuracy is **94.1%** showing that there still more improvement possible on scaling RL*
*Accuracies shown above take structured output format in consideration, requiring reasoning to be enclosed within think tags and numerical answer between answer tags*

## Training Stages

### **Stage 1 — Supervised Fine-Tuning (SFT)**

Checkpoint: [**arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10**](https://huggingface.co/aryan-kolapkar/Qwen-2.5_1.5b_MATH_GSM8K_SFT10)


* Dataset: curated GSM8K subset with self-verified generations
* Epochs: 10
* LR: 3e-6
* Batch size: 4
* Gradient accumulation: 4
* Only correct & well‑formatted CoT samples used to minimize model entropy

### **Stage 2 — Direct Preference Optimization (DPO)**

Checkpoint: [**arubittu/Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3**](https://huggingface.co/aryan-kolapkar/Qwen-2.5_1.5b_MATH_GSM8K_SFT10_DPO3) 

* Dataset: ~1,000 preference pairs
* Mostly hard pairs (correct vs incorrect)
* Some soft preferences (shorter correct CoT)
* For each GSM8K problem, 4 samples were generated → chosen = correct, rejected = incorrect
* Epochs: 3
* β = 0.1, LR = 3e‑6

### **Stage 3 — GRPO Reinforcement Learning**

This model was further trained with GRPO on GSM8K train split.

* Steps: 400
* Loss type: DAPO
* Rollouts per prompt: 4
* Gradient accumulation : 8
* Custom reward: format strictness + correctness 
* vLLM enabled rollout, TRL trainer


## Prompt Template

```python
def prompt_input(question):
  prompt = f'''A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.
  User: {question}
  Assistant: <think>'''
  return prompt
```

## Code for inference

```python
from transformers import pipeline

pipe = pipeline("text-generation", model="arubittu/MathReasoner-Mini-1.5b")

def prompt_input(question):
  r1_zero_prompt = f'''A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer. The reasoning process is enclosed within <think> </think> and answer is enclosed within <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think> <answer> answer here </answer>.
  User: {question}
  Assistant: <think>'''
  return r1_zero_prompt

#questions to try on:
#"Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"
#"Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?"

prompt = "A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?"
response = pipe(prompt_input(prompt))
generated_answer = response[0]['generated_text'].split('<think>')[-1]
print(generated_answer)
```

## Scope for Improvement

* Train with more diverse math data leading to better generalization. Currently all the three stages were trained using the entire or subsets of GSM8K trainset. Including diverse datasets will lead to further improvement and the model will be able to solve harder problems. This path can be pursued in the future if more compute resources are available
* Optimizing current training by using curriculum learning with temperature and rollout scaling especially in RL phase, as done by PhysicsWallahAI/Aryabhata-1.0
* Exploring methods like Model merging and on-policy Distillation from larger models

## Contact

If you find issues or want improvements, feel free to open an issue or discussion on the Hugging Face page.