Qwen2.5-1.5B GRPO Math

A Qwen2.5-1.5B-Instruct model fine-tuned with GRPO (Group Relative Policy Optimization) on the GSM8K math dataset. The model learned to solve grade school math word problems through reinforcement learning — no human-labeled reasoning chains, just a binary correctness signal.

What is GRPO?

GRPO is the same RL technique used to train DeepSeek-R1. Instead of learning from labeled step-by-step solutions, the model:

Generates multiple answers (group of 4) for each math problem
Checks correctness — does the final number match the ground truth?
Learns from relative performance — within each group, correct answers get higher advantage, wrong ones get lower

This teaches the model to reason and self-verify without explicit reasoning supervision.

Training Details

Parameter	Value
Base model	Qwen/Qwen2.5-1.5B-Instruct
Method	GRPO with LoRA (r=16, alpha=32)
Dataset	openai/gsm8k — 7,473 math problems
Hardware	NVIDIA RTX 5090 (32GB VRAM)
Training time	~8.4 hours
Epochs	1
Batch size	1 (gradient accumulation: 8, effective: 8)
Learning rate	5e-6 (cosine schedule, 20 warmup steps)
GRPO group size	4 completions per prompt
Max completion length	512 tokens
Precision	bf16
Framework	TRL 0.29.1 + Transformers 5.3.0

Performance

Metric	Value
Starting accuracy	25% (pre-training baseline)
Final accuracy (MA-50)	83.3%
Peak batch accuracy	100%

The model went from solving 1 in 4 math problems to 4 in 5 through pure RL training.

Training Curves

$Training Metrics$

Training Loss: Converges near zero after initial policy updates
Mean Reward: Jumps from 0.25 to ~0.80 within the first 200 steps, stabilizes at 0.80-0.90
Learning Rate: Cosine decay from 5e-6 to 0
Reward Std: Stays non-zero (~0.25) throughout — healthy gradient signal, no reward collapse

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model + LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base_model, "usama10/qwen-1.5b-grpo-math")
tokenizer = AutoTokenizer.from_pretrained("usama10/qwen-1.5b-grpo-math")

messages = [
    {"role": "system", "content": "You are a math tutor. Solve the problem step by step, then give the final numerical answer after ####."},
    {"role": "user", "content": "A store sells apples for $2 each and oranges for $3 each. If Sarah buys 5 apples and 3 oranges, how much does she spend?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Reward Function

The reward function extracts the final numerical answer after #### from the model's output and compares it to the ground truth:

def correctness_reward(completion, ground_truth):
    predicted = extract_number_after_hashes(completion)
    expected = extract_number_after_hashes(ground_truth)
    return 1.0 if predicted == expected else 0.0

Limitations

Trained on grade school math only (GSM8K) — may not generalize to advanced math
Single epoch of training — more epochs or larger models could improve performance
LoRA adapter — needs base model to run inference

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for usama10/qwen-1.5b-grpo-math

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(821)

this model

Dataset used to train usama10/qwen-1.5b-grpo-math

Paper for usama10/qwen-1.5b-grpo-math

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper • 2501.12948 • Published Jan 22, 2025 • 448

Evaluation results

Correctness Reward (MA-50) on GSM8K
self-reported

83.300