| # OPSD Experiment Results |
|
|
| Reproduction of [OPSD (On-Policy Self-Distillation)](https://github.com/siyan-zhao/OPSD) on Qwen3-1.7B, 4B, and 8B. |
|
|
| ## Results (Avg@12) |
|
|
| ### Qwen3-1.7B |
| | Method | AIME24 | AIME25 | HMMT25 | |
| |--------|:------:|:------:|:------:| |
| | Base | 47.2% | 35.3% | 21.9% | |
| | OPSD (best) | **49.2%** | **37.5%** | **24.4%** | |
| | SFT (best) | 37.5% | 30.8% | 19.2% | |
| | GRPO (best) | 47.8% | 35.0% | 22.8% | |
|
|
| ### Qwen3-4B |
| | Method | AIME24 | AIME25 | HMMT25 | |
| |--------|:------:|:------:|:------:| |
| | Base | **71.1%** | 60.0% | 38.6% | |
| | OPSD (best) | 62.2% | 57.2% | 34.2% | |
| | SFT (best) | 62.5% | 58.1% | 33.3% | |
| | GRPO (best) | 68.9% | **65.0%** | **41.9%** | |
|
|
| ### Qwen3-8B |
| | Method | AIME24 | AIME25 | HMMT25 | |
| |--------|:------:|:------:|:------:| |
| | Base | **72.8%** | 61.7% | 38.6% | |
| | OPSD (best) | 69.4% | 63.3% | 38.6% | |
| | SFT (best) | 69.2% | 60.3% | 36.1% | |
| | GRPO (best) | 72.2% | **65.8%** | **40.8%** | |
|
|
| ## Setup |
| - All methods: lr=5e-6, BS=32, LoRA r=64 alpha=128, 200 steps |
| - Eval: val_n=12, temperature=1.0, thinking mode enabled |
| - Data: siyanzhao/Openthoughts_math_30k_opsd |
|
|
| ## Reference |
| [Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs](https://arxiv.org/pdf/2601.18734v3) |
|
|