Qwen2.5-1.5B GRPO Math

A Qwen2.5-1.5B-Instruct model fine-tuned with GRPO (Group Relative Policy Optimization) on the GSM8K math dataset. The model learned to solve grade school math word problems through reinforcement learning β€” no human-labeled reasoning chains, just a binary correctness signal.

What is GRPO?

GRPO is the same RL technique used to train DeepSeek-R1. Instead of learning from labeled step-by-step solutions, the model:

  1. Generates multiple answers (group of 4) for each math problem
  2. Checks correctness β€” does the final number match the ground truth?
  3. Learns from relative performance β€” within each group, correct answers get higher advantage, wrong ones get lower

This teaches the model to reason and self-verify without explicit reasoning supervision.

Training Details

Parameter Value
Base model Qwen/Qwen2.5-1.5B-Instruct
Method GRPO with LoRA (r=16, alpha=32)
Dataset openai/gsm8k β€” 7,473 math problems
Hardware NVIDIA RTX 5090 (32GB VRAM)
Training time ~8.4 hours
Epochs 1
Batch size 1 (gradient accumulation: 8, effective: 8)
Learning rate 5e-6 (cosine schedule, 20 warmup steps)
GRPO group size 4 completions per prompt
Max completion length 512 tokens
Precision bf16
Framework TRL 0.29.1 + Transformers 5.3.0

Performance

Metric Value
Starting accuracy 25% (pre-training baseline)
Final accuracy (MA-50) 83.3%
Peak batch accuracy 100%

The model went from solving 1 in 4 math problems to 4 in 5 through pure RL training.

Training Curves

Training Metrics

  • Training Loss: Converges near zero after initial policy updates
  • Mean Reward: Jumps from 0.25 to ~0.80 within the first 200 steps, stabilizes at 0.80-0.90
  • Learning Rate: Cosine decay from 5e-6 to 0
  • Reward Std: Stays non-zero (~0.25) throughout β€” healthy gradient signal, no reward collapse

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model + LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base_model, "usama10/qwen-1.5b-grpo-math")
tokenizer = AutoTokenizer.from_pretrained("usama10/qwen-1.5b-grpo-math")

messages = [
    {"role": "system", "content": "You are a math tutor. Solve the problem step by step, then give the final numerical answer after ####."},
    {"role": "user", "content": "A store sells apples for $2 each and oranges for $3 each. If Sarah buys 5 apples and 3 oranges, how much does she spend?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Reward Function

The reward function extracts the final numerical answer after #### from the model's output and compares it to the ground truth:

def correctness_reward(completion, ground_truth):
    predicted = extract_number_after_hashes(completion)
    expected = extract_number_after_hashes(ground_truth)
    return 1.0 if predicted == expected else 0.0

Limitations

  • Trained on grade school math only (GSM8K) β€” may not generalize to advanced math
  • Single epoch of training β€” more epochs or larger models could improve performance
  • LoRA adapter β€” needs base model to run inference
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for usama10/qwen-1.5b-grpo-math

Adapter
(821)
this model

Dataset used to train usama10/qwen-1.5b-grpo-math

Paper for usama10/qwen-1.5b-grpo-math

Evaluation results