Model Card for Qwen2.5-0.5B-SFT-RL-math

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B, first supervised fine-tuned (SFT) on a mixture of math reasoning datasets, then further optimized with reinforcement learning (RL).

Evaluation Results

The table below compares the base model, the SFT checkpoint, and the final RL checkpoint (2600 steps). All numbers are proportions (0–1 scale).

Math Benchmarks

Benchmark Base SFT SFT+RL (ckpt-2600) Δ over Base
GSM8K 0.2161 0.3616 0.3624 +0.1463
MATH-500 0.1740 0.1500 0.1640 -0.0100
TheoremQA 0.1000 0.1088 0.1138 +0.0138

General QA Benchmarks

Benchmark Base SFT SFT+RL (ckpt-2600) Δ over Base
ARC-Easy 0.4533 0.4933 0.4882 +0.0349
ARC-Challenge 0.3131 0.3592 0.3592 +0.0461
MMLU 0.2514 0.4185 0.4259 +0.1745
TruthfulQA 0.2950 0.2681 0.2644 -0.0306

Evaluation details: GSM8K and MATH-500 were evaluated with flexible-extract matching; TheoremQA used the built-in scoring; ARC and MMLU used accuracy; TruthfulQA used truthful answer selection. The RL checkpoint corresponds to step 2600 of the reinforcement learning stage.

Version History

0.3 (SFT + RL)

  • Applied reinforcement learning on top of the SFT math model to improve reasoning consistency and final answer accuracy.
  • RL training details (algorithm, reward design, datasets) can be provided on request – broadly, it targeted better alignment of chain-of-thought with correct final answers.

0.2 (SFT)

Mixed small subset of multiple math-domain datasets, total 25k examples.

Dataset Target Size Role
openai/gsm8k (train) ~7.5K Foundation arithmetic and word-problem reasoning
AI-MO/NuminaMath-CoT ~8K Competition-math coverage for MATH-500-style problems
TIGER-Lab/MathInstruct (CoT-only) ~5K Diverse math reasoning and theorem-style supervision
hendrycks/competition_math (train, L3-L5) ~3K Higher-difficulty competition math
TIGER-Lab/TheoremQA-aligned slice ~1.5K Basic theorem-application exposure

Training Procedure

SFT Hyperparameters

  • learning_rate: 1e-05
  • train_batch_size: 2
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • per_device_train_batch_size: 2
  • gradient_accumulation_steps: 8
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 100
  • num_epochs: 2.0

RL Stage

  • The SFT checkpoint was further trained with reinforcement learning (e.g., online/off-policy preference optimization) to improve solution correctness.
  • The best performing checkpoint at step 2600 is reported in the tables above.
  • Detailed RL hyperparameters and reward design available on request.

Framework Versions

  • Transformers 5.2.0
  • PyTorch 2.11.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
16
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Video Preview
loading

Model tree for Sepolian/qwen2.5-0.5B-math

Finetuned
(617)
this model