Reinforcement Learning
Transformers
Safetensors
qwen2
text-generation
llama-factory
full
Generated from Trainer
math
text-generation-inference
Instructions to use Sepolian/qwen2.5-0.5B-math with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Sepolian/qwen2.5-0.5B-math with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Sepolian/qwen2.5-0.5B-math") model = AutoModelForCausalLM.from_pretrained("Sepolian/qwen2.5-0.5B-math") - Notebooks
- Google Colab
- Kaggle
Model Card for Qwen2.5-0.5B-SFT-RL-math
This model is a fine-tuned version of Qwen/Qwen2.5-0.5B, first supervised fine-tuned (SFT) on a mixture of math reasoning datasets, then further optimized with reinforcement learning (RL).
Evaluation Results
The table below compares the base model, the SFT checkpoint, and the final RL checkpoint (2600 steps). All numbers are proportions (0–1 scale).
Math Benchmarks
| Benchmark | Base | SFT | SFT+RL (ckpt-2600) | Δ over Base |
|---|---|---|---|---|
| GSM8K | 0.2161 | 0.3616 | 0.3624 | +0.1463 |
| MATH-500 | 0.1740 | 0.1500 | 0.1640 | -0.0100 |
| TheoremQA | 0.1000 | 0.1088 | 0.1138 | +0.0138 |
General QA Benchmarks
| Benchmark | Base | SFT | SFT+RL (ckpt-2600) | Δ over Base |
|---|---|---|---|---|
| ARC-Easy | 0.4533 | 0.4933 | 0.4882 | +0.0349 |
| ARC-Challenge | 0.3131 | 0.3592 | 0.3592 | +0.0461 |
| MMLU | 0.2514 | 0.4185 | 0.4259 | +0.1745 |
| TruthfulQA | 0.2950 | 0.2681 | 0.2644 | -0.0306 |
Evaluation details: GSM8K and MATH-500 were evaluated with flexible-extract matching; TheoremQA used the built-in scoring; ARC and MMLU used accuracy; TruthfulQA used truthful answer selection. The RL checkpoint corresponds to step 2600 of the reinforcement learning stage.
Version History
0.3 (SFT + RL)
- Applied reinforcement learning on top of the SFT math model to improve reasoning consistency and final answer accuracy.
- RL training details (algorithm, reward design, datasets) can be provided on request – broadly, it targeted better alignment of chain-of-thought with correct final answers.
0.2 (SFT)
Mixed small subset of multiple math-domain datasets, total 25k examples.
| Dataset | Target Size | Role |
|---|---|---|
openai/gsm8k (train) |
~7.5K | Foundation arithmetic and word-problem reasoning |
AI-MO/NuminaMath-CoT |
~8K | Competition-math coverage for MATH-500-style problems |
TIGER-Lab/MathInstruct (CoT-only) |
~5K | Diverse math reasoning and theorem-style supervision |
hendrycks/competition_math (train, L3-L5) |
~3K | Higher-difficulty competition math |
TIGER-Lab/TheoremQA-aligned slice |
~1.5K | Basic theorem-application exposure |
Training Procedure
SFT Hyperparameters
- learning_rate: 1e-05
- train_batch_size: 2
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- per_device_train_batch_size: 2
- gradient_accumulation_steps: 8
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 100
- num_epochs: 2.0
RL Stage
- The SFT checkpoint was further trained with reinforcement learning (e.g., online/off-policy preference optimization) to improve solution correctness.
- The best performing checkpoint at step 2600 is reported in the tables above.
- Detailed RL hyperparameters and reward design available on request.
Framework Versions
- Transformers 5.2.0
- PyTorch 2.11.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.2
- Downloads last month
- 16
Model tree for Sepolian/qwen2.5-0.5B-math
Base model
Qwen/Qwen2.5-0.5B