You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Qwen2.5-0.5B-Math-GRPO

GRPO Reinforcement Learning post-trained version of Qwen2.5-0.5B-Math-SFT, trained with a multi-component verifier-based reward to further improve mathematical reasoning beyond SFT imitation.

Part of the AIMS5740 Final Project on Data Selection + RL for LLMs (Math / STEM).


πŸ† Evaluation Results

Benchmark Base SFT GRPO-RL (This) SFT→RL Δ
MATH-500 nan% nan% nan% β€”
GSM8K nan% nan% nan% β€”
MMLU N/A% N/A% N/A% β€”

Note: MMLU measures general capability; RL may trade some general QA for math accuracy.


🎯 Reward Function Design

The model was trained with a combined multi-component reward (required β‰₯2 by project spec):

Component Formula Weight
Correctness Reward Exact match vs. ground truth after answer normalization 1.0
Format Reward Checks \\boxed{} presence (0.5) + reasoning length β‰₯50 words (0.3) + <think> tags (0.2) 0.3Γ—
Length Penalty Penalizes outputs outside [30, 1500] word range up to βˆ’0.2

Total reward = correctness + 0.3 Γ— format + length_penalty

Answer normalization handles: numeric tolerance (1e-4), LaTeX fractions (\\frac{a}{b}), sign normalization, percent/dollar stripping.


βš™οΈ Training Configuration (GRPO)

Parameter Value
Starting checkpoint tengfeima-ai/Qwen2.5-0.5B-Math-SFT
Algorithm GRPO (Group Relative Policy Optimization)
Framework TRL GRPOTrainer
Fine-tuning type Full fine-tuning (no LoRA)
Hardware 2Γ— NVIDIA H100 SXM 80GB (NVLink)
Multi-GPU strategy Accelerate DDP
Precision bf16
Attention Flash Attention 2
Batch size (per GPU) 8
Gradient accumulation 4
Num generations G 16 rollouts/prompt
Effective rollouts/step 1,024 (8 Γ— 2 GPUs Γ— 4 accum Γ— 16 gen)
Learning rate 5e-6
LR scheduler Cosine
Warmup ratio 0.05
Max grad norm 0.1
Adam Ξ²β‚‚ 0.99 (standard for RL)
Max prompt length 512 tokens
Max new tokens 1024 tokens
Epochs 1
Training date 2026-03-28

🧠 Training Pipeline Overview

DeepMath-103K (raw 103K)
        β”‚  filtering (ambiguity, format, length)
        β–Ό
  ~98K clean samples
        β”‚
        β”œβ”€β–Ά [Stage B] SFT Full Fine-Tuning (LLaMA-Factory, 2Γ—H100)
        β”‚         └─▢ Qwen2.5-0.5B-Math-SFT
        β”‚
        └─▢ [Stage C] GRPO RL (TRL, 2Γ—H100, G=16)
                  └─▢ Qwen2.5-0.5B-Math-GRPO  ← this model

Inspired by DeepSeek-R1's multi-stage training paradigm.


πŸ’¬ Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "tengfeima-ai/Qwen2.5-0.5B-Math-GRPO",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("tengfeima-ai/Qwen2.5-0.5B-Math-GRPO")

messages = [
    {"role": "system", "content": "You are a math expert. Think step by step inside <think>...</think> tags, then give the final answer in \\boxed{}."},
    {"role": "user",   "content": "Find all real solutions to xΒ² - 5x + 6 = 0."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.6, top_p=0.95)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

πŸ“„ Citation

@misc{tengfeima2026qwen25mathgrpo,
  title     = {Qwen2.5-0.5B-Math-GRPO: RL Post-Training for Math Reasoning},
  author    = {Anonymous},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/tengfeima-ai/Qwen2.5-0.5B-Math-GRPO}
}
Downloads last month
190
Safetensors
Model size
0.6B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tengfeima-ai/Qwen2.5-0.5B-Math-GRPO

Finetuned
(1)
this model

Dataset used to train tengfeima-ai/Qwen2.5-0.5B-Math-GRPO

Paper for tengfeima-ai/Qwen2.5-0.5B-Math-GRPO