Model Card for Qwen2.5-0.5B-SFT-RL-math

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B, first supervised fine-tuned (SFT) on a mixture of math reasoning datasets, then further optimized with reinforcement learning (RL).

Evaluation Results

The table below compares the base model, the SFT checkpoint, and the final RL checkpoint (2600 steps). All numbers are proportions (0–1 scale).

Math Benchmarks

Benchmark	Base	SFT	SFT+RL (ckpt-2600)	Δ over Base
GSM8K	0.2161	0.3616	0.3624	+0.1463
MATH-500	0.1740	0.1500	0.1640	-0.0100
TheoremQA	0.1000	0.1088	0.1138	+0.0138

General QA Benchmarks

Benchmark	Base	SFT	SFT+RL (ckpt-2600)	Δ over Base
ARC-Easy	0.4533	0.4933	0.4882	+0.0349
ARC-Challenge	0.3131	0.3592	0.3592	+0.0461
MMLU	0.2514	0.4185	0.4259	+0.1745
TruthfulQA	0.2950	0.2681	0.2644	-0.0306

Evaluation details: GSM8K and MATH-500 were evaluated with flexible-extract matching; TheoremQA used the built-in scoring; ARC and MMLU used accuracy; TruthfulQA used truthful answer selection. The RL checkpoint corresponds to step 2600 of the reinforcement learning stage.

Version History

0.3 (SFT + RL)

Applied reinforcement learning on top of the SFT math model to improve reasoning consistency and final answer accuracy.
RL training details (algorithm, reward design, datasets) can be provided on request – broadly, it targeted better alignment of chain-of-thought with correct final answers.

0.2 (SFT)

Mixed small subset of multiple math-domain datasets, total 25k examples.

Dataset	Target Size	Role
`openai/gsm8k` (train)	~7.5K	Foundation arithmetic and word-problem reasoning
`AI-MO/NuminaMath-CoT`	~8K	Competition-math coverage for MATH-500-style problems
`TIGER-Lab/MathInstruct` (CoT-only)	~5K	Diverse math reasoning and theorem-style supervision
`hendrycks/competition_math` (train, L3-L5)	~3K	Higher-difficulty competition math
`TIGER-Lab/TheoremQA`-aligned slice	~1.5K	Basic theorem-application exposure

Training Procedure

SFT Hyperparameters

learning_rate: 1e-05
train_batch_size: 2
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 100
num_epochs: 2.0

RL Stage

The SFT checkpoint was further trained with reinforcement learning (e.g., online/off-policy preference optimization) to improve solution correctness.
The best performing checkpoint at step 2600 is reported in the tables above.
Detailed RL hyperparameters and reward design available on request.

Framework Versions

Transformers 5.2.0
PyTorch 2.11.0+cu128
Datasets 4.0.0
Tokenizers 0.22.2

Downloads last month: 24

Safetensors

Model size

0.6B params

Tensor type

BF16

Video Preview

Reinforcement Learning

Model tree for Sepolian/qwen2.5-0.5B-math

Base model

Qwen/Qwen2.5-0.5B

Finetuned

(685)

this model