Qwen2.5-0.5B-Math-GRPO
GRPO Reinforcement Learning post-trained version of Qwen2.5-0.5B-Math-SFT, trained with a multi-component verifier-based reward to further improve mathematical reasoning beyond SFT imitation.
Part of the AIMS5740 Final Project on Data Selection + RL for LLMs (Math / STEM).
π Evaluation Results
| Benchmark | Base | SFT | GRPO-RL (This) | SFTβRL Ξ |
|---|---|---|---|---|
| MATH-500 | nan% | nan% | nan% | β |
| GSM8K | nan% | nan% | nan% | β |
| MMLU | N/A% | N/A% | N/A% | β |
Note: MMLU measures general capability; RL may trade some general QA for math accuracy.
π― Reward Function Design
The model was trained with a combined multi-component reward (required β₯2 by project spec):
| Component | Formula | Weight |
|---|---|---|
| Correctness Reward | Exact match vs. ground truth after answer normalization | 1.0 |
| Format Reward | Checks \\boxed{} presence (0.5) + reasoning length β₯50 words (0.3) + <think> tags (0.2) |
0.3Γ |
| Length Penalty | Penalizes outputs outside [30, 1500] word range | up to β0.2 |
Total reward = correctness + 0.3 Γ format + length_penalty
Answer normalization handles: numeric tolerance (1e-4), LaTeX fractions (\\frac{a}{b}), sign normalization, percent/dollar stripping.
βοΈ Training Configuration (GRPO)
| Parameter | Value |
|---|---|
| Starting checkpoint | tengfeima-ai/Qwen2.5-0.5B-Math-SFT |
| Algorithm | GRPO (Group Relative Policy Optimization) |
| Framework | TRL GRPOTrainer |
| Fine-tuning type | Full fine-tuning (no LoRA) |
| Hardware | 2Γ NVIDIA H100 SXM 80GB (NVLink) |
| Multi-GPU strategy | Accelerate DDP |
| Precision | bf16 |
| Attention | Flash Attention 2 |
| Batch size (per GPU) | 8 |
| Gradient accumulation | 4 |
| Num generations G | 16 rollouts/prompt |
| Effective rollouts/step | 1,024 (8 Γ 2 GPUs Γ 4 accum Γ 16 gen) |
| Learning rate | 5e-6 |
| LR scheduler | Cosine |
| Warmup ratio | 0.05 |
| Max grad norm | 0.1 |
| Adam Ξ²β | 0.99 (standard for RL) |
| Max prompt length | 512 tokens |
| Max new tokens | 1024 tokens |
| Epochs | 1 |
| Training date | 2026-03-28 |
π§ Training Pipeline Overview
DeepMath-103K (raw 103K)
β filtering (ambiguity, format, length)
βΌ
~98K clean samples
β
βββΆ [Stage B] SFT Full Fine-Tuning (LLaMA-Factory, 2ΓH100)
β βββΆ Qwen2.5-0.5B-Math-SFT
β
βββΆ [Stage C] GRPO RL (TRL, 2ΓH100, G=16)
βββΆ Qwen2.5-0.5B-Math-GRPO β this model
Inspired by DeepSeek-R1's multi-stage training paradigm.
π¬ Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"tengfeima-ai/Qwen2.5-0.5B-Math-GRPO",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("tengfeima-ai/Qwen2.5-0.5B-Math-GRPO")
messages = [
{"role": "system", "content": "You are a math expert. Think step by step inside <think>...</think> tags, then give the final answer in \\boxed{}."},
{"role": "user", "content": "Find all real solutions to xΒ² - 5x + 6 = 0."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_p=0.95)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
π Citation
@misc{tengfeima2026qwen25mathgrpo,
title = {Qwen2.5-0.5B-Math-GRPO: RL Post-Training for Math Reasoning},
author = {Anonymous},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/tengfeima-ai/Qwen2.5-0.5B-Math-GRPO}
}
- Downloads last month
- 190
Model tree for tengfeima-ai/Qwen2.5-0.5B-Math-GRPO
Base model
Qwen/Qwen2.5-0.5B Finetuned
Qwen/Qwen2.5-0.5B-Instruct Finetuned
tengfeima-ai/Qwen2.5-0.5B-Math-SFTDataset used to train tengfeima-ai/Qwen2.5-0.5B-Math-GRPO
Paper for tengfeima-ai/Qwen2.5-0.5B-Math-GRPO
Paper β’ 2501.12948 β’ Published β’ 444