Qwen2.5-Math-1.5B-Instruct β€” MathHard GRPO LoRA Adapter

LoRA adapter trained on the Hendrycks MATH level-5 subset using GRPO (Group Relative Policy Optimization).

Training Details

  • Base model: Qwen/Qwen2.5-Math-1.5B-Instruct
  • Algorithm: GRPO (PPO-style clipped surrogate, group-relative advantage normalization)
  • Dataset: the-jb/hendrycks-math β€” level 5 (hardest) problems only
  • Steps: 501
  • Batch size: 4 prompts Γ— 4 samples (group size) = 16 rollouts per step
  • PPO epochs: 2
  • Max new tokens: 512
  • Learning rate: 3e-5
  • KL coefficient: 0.05
  • Clip epsilon: 0.2

LoRA Config

Parameter Value
Rank 16
Alpha 32
Dropout 0.05
Target modules q/k/v/o_proj, gate/up/down_proj
Trainable params ~18.5M / 1562M total

Reward Function

  • Correct reward: 1.0 β€” exact numeric match inside \boxed{}
  • Format reward: 0.1 β€” for including \boxed{} in output

Evaluation Results (held-out test set, 512 level-5 problems)

Metric Baseline (step 0) Final (step 501)
Exact match (boxed parser) 22.7% 35.7%
Exact match (relaxed parser) 26.2% 38.7%

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Math-1.5B-Instruct")
model = PeftModel.from_pretrained(base, "PeterWright/Qwen2.5-1.5B-MathHard-GRPO")
tokenizer = AutoTokenizer.from_pretrained("PeterWright/Qwen2.5-1.5B-MathHard-GRPO")

messages = [
    {"role": "system", "content": "Solve the competition-level math problem.\nReturn the final answer in this exact format: \\boxed{NUMBER}"},
    {"role": "user", "content": "What is the sum of all integers from 1 to 100?\n\nReturn only: \\boxed{NUMBER}"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Framework versions

  • PEFT 0.19.1
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for PeterWright/Qwen2.5-1.5B-MathHard-GRPO

Adapter
(15)
this model