Qwen2.5-0.5B-Math-SFT
Supervised Fine-Tuned version of Qwen/Qwen2.5-0.5B on 32,774 high-quality mathematical reasoning samples from DeepMath-103K, with DeepSeek-R1-generated chain-of-thought solutions as training targets.
This is Stage B of the AIMS5740 Final Project pipeline on Data Selection + RL for LLMs (Math/STEM). The GRPO-trained successor is at tengfeima-ai/Qwen2.5-0.5B-Math-GRPO.
🔁 3-Stage Training Pipeline
Stage A ─ Data Selection & Filtering
DeepMath-103K (103,022 raw samples)
↓ difficulty ≥ 3/10, length filters, valid answer check
32,774 curated samples (33.5% retention)
Stage B ─ Supervised Fine-Tuning ← THIS MODEL
Base: Qwen/Qwen2.5-0.5B
↓ 3 epochs · 2×H100 SXM · DeepSpeed ZeRO-2 · Flash Attn 2
Qwen2.5-0.5B-Math-SFT
Stage C ─ GRPO Reinforcement Learning
Qwen2.5-0.5B-Math-SFT
↓ reward = correctness + format + length_penalty
Qwen2.5-0.5B-Math-GRPO
Inspired by DeepSeek-R1: imitate R1 CoT via SFT first, then refine with outcome-based RL rewards.
🏆 Evaluation Results
| Benchmark | Base Model | This Model (SFT) | Δ |
|---|---|---|---|
| MATH-500 | nan% | nan% | — |
| GSM8K | nan% | nan% | — |
| MMLU-STEM | N/A% | N/A% | — |
Evaluation conducted with lm-evaluation-harness. Results for GRPO model: see tengfeima-ai/Qwen2.5-0.5B-Math-GRPO.
🗂️ Training Data — DeepMath-103K (Filtered)
| Property | Value |
|---|---|
| Source | zwhe99/DeepMath-103K |
| Raw samples | 103,022 |
| After filtering | 32,774 (33.5% retention) |
| Main rejection cause | R1 solutions > 2048 words (52,778 samples) |
| Solution type | DeepSeek-R1 chain-of-thought (r1_solution_1/2/3) |
| Topics | Competition math, algebra, number theory, combinatorics, calculus |
Stage A filter criteria:
- Difficulty score ≥ 3.0 (DeepMath native score, scale 1–10)
- Solution word count: 50 – 2048 words
- Non-empty
final_answerfield - Best of 3 R1 solutions selected by length heuristic
Training format (Alpaca-style):
{
"instruction": "Solve the following math problem step by step.",
"input": "<problem statement>",
"output": "<R1-style CoT reasoning>\n\nThe answer is: <final_answer>"
}
📊 Training Metrics
| Metric | Value |
|---|---|
| Final train loss | 0.6287 |
| Final eval loss | 0.6340 |
| Total epochs | 3 |
| Total optimizer steps | 1,521 |
| Training time | 40.5 minutes |
| Throughput | 40.1 samples/sec |
| Total FLOPs | 4.28e+17 |
| Final learning rate | ~9.5e-11 (cosine decay to ~0) |
Loss curve: decreased from ~1.2 (step 1) → ~0.57 (step 1520), indicating good convergence without overfitting (eval loss tracked train loss closely throughout).
⚙️ Training Configuration
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-0.5B |
| Fine-tuning method | Full fine-tuning (no LoRA/PEFT) |
| Framework | LLaMA-Factory v0.9+ |
| Hardware | 2× NVIDIA H100 SXM 80GB HBM3 (NVLink 4.0) |
| Multi-GPU | DeepSpeed ZeRO Stage 2 |
| Precision | bfloat16 |
| Attention | Flash Attention 2 |
| Per-device batch size | 4 |
| Gradient accumulation steps | 8 |
| Effective global batch size | 64 (4 × 8 × 2 GPUs) |
| Optimizer | AdamW (β₁=0.9, β₂=0.999) |
| Learning rate | 1e-5 |
| LR scheduler | Cosine with warmup |
| Warmup ratio | 0.03 |
| Weight decay | 0.01 |
| Max gradient norm | 1.0 |
| Max sequence length | 2048 tokens |
| Gradient checkpointing | Enabled (saves ~30% VRAM) |
| Peak GPU memory | ~26 GB / 80 GB per H100 |
| Training date | 2025-03-28 |
💬 Prompt Format
This model uses the Qwen chat template. For best results:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("tengfeima-ai/Qwen2.5-0.5B-Math-SFT")
tokenizer = AutoTokenizer.from_pretrained("tengfeima-ai/Qwen2.5-0.5B-Math-SFT")
messages = [
{"role": "system", "content": "You are a math expert. Think step by step and end with the final answer in \\boxed{}."},
{"role": "user", "content": "Solve: What is the sum of all integers from 1 to 100?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.0)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
📚 Related Work
- MetaMath — Math data augmentation
- GRPO (TRL) — RL post-training
- DeepSeek-R1 — Inspiration for training pipeline
📄 Citation
@misc{tengfeima2026qwen25mathsft,
title = {Qwen2.5-0.5B-Math-SFT: Supervised Fine-Tuning for Math Reasoning},
author = {Tengfei Ma},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/tengfeima-ai/Qwen2.5-0.5B-Math-SFT}
}
- Downloads last month
- 460