You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Qwen2.5-0.5B-Math-GRPO

GRPO Reinforcement Learning post-trained version of Qwen2.5-0.5B-Math-SFT, trained with a multi-component verifier-based reward to further improve mathematical reasoning beyond SFT imitation.

Part of the AIMS5740 Final Project on Data Selection + RL for LLMs (Math / STEM).

🏆 Evaluation Results

Benchmark	Base	SFT	GRPO-RL (This)	SFT→RL Δ
MATH-500	nan%	nan%	nan%	—
GSM8K	nan%	nan%	nan%	—
MMLU	N/A%	N/A%	N/A%	—

Note: MMLU measures general capability; RL may trade some general QA for math accuracy.

🎯 Reward Function Design

The model was trained with a combined multi-component reward (required ≥2 by project spec):

Component	Formula	Weight
Correctness Reward	Exact match vs. ground truth after answer normalization	1.0
Format Reward	Checks `\\boxed{}` presence (0.5) + reasoning length ≥50 words (0.3) + `<think>` tags (0.2)	0.3×
Length Penalty	Penalizes outputs outside [30, 1500] word range	up to −0.2

Total reward = correctness + 0.3 × format + length_penalty

Answer normalization handles: numeric tolerance (1e-4), LaTeX fractions (\\frac{a}{b}), sign normalization, percent/dollar stripping.

⚙️ Training Configuration (GRPO)

Parameter	Value
Starting checkpoint	`tengfeima-ai/Qwen2.5-0.5B-Math-SFT`
Algorithm	GRPO (Group Relative Policy Optimization)
Framework	TRL GRPOTrainer
Fine-tuning type	Full fine-tuning (no LoRA)
Hardware	2× NVIDIA H100 SXM 80GB (NVLink)
Multi-GPU strategy	Accelerate DDP
Precision	bf16
Attention	Flash Attention 2
Batch size (per GPU)	8
Gradient accumulation	4
Num generations G	16 rollouts/prompt
Effective rollouts/step	1,024 (8 × 2 GPUs × 4 accum × 16 gen)
Learning rate	5e-6
LR scheduler	Cosine
Warmup ratio	0.05
Max grad norm	0.1
Adam β₂	0.99 (standard for RL)
Max prompt length	512 tokens
Max new tokens	1024 tokens
Epochs	1
Training date	2026-03-28

🧠 Training Pipeline Overview

DeepMath-103K (raw 103K)
        │  filtering (ambiguity, format, length)
        ▼
  ~98K clean samples
        │
        ├─▶ [Stage B] SFT Full Fine-Tuning (LLaMA-Factory, 2×H100)
        │         └─▶ Qwen2.5-0.5B-Math-SFT
        │
        └─▶ [Stage C] GRPO RL (TRL, 2×H100, G=16)
                  └─▶ Qwen2.5-0.5B-Math-GRPO  ← this model

Inspired by DeepSeek-R1's multi-stage training paradigm.

💬 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "tengfeima-ai/Qwen2.5-0.5B-Math-GRPO",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("tengfeima-ai/Qwen2.5-0.5B-Math-GRPO")

messages = [
    {"role": "system", "content": "You are a math expert. Think step by step inside <think>...</think> tags, then give the final answer in \\boxed{}."},
    {"role": "user",   "content": "Find all real solutions to x² - 5x + 6 = 0."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_p=0.95)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

📄 Citation

@misc{tengfeima2026qwen25mathgrpo,
  title     = {Qwen2.5-0.5B-Math-GRPO: RL Post-Training for Math Reasoning},
  author    = {Anonymous},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/tengfeima-ai/Qwen2.5-0.5B-Math-GRPO}
}