OLMo-2-1B Off-Policy Distilled
OLMo-2-0425-1B-Instruct distilled from OLMo-2-0425-7B-Instruct using off-policy knowledge distillation (top-128 partial KL divergence) on the DOLCI-instruct-rl-7B dataset. ~2,545 training steps with LR=1e-4, 8-bit AdamW, 3x RTX 3090.
Evaluation
| Benchmark | Base (1B-Instruct) | This model |
|---|---|---|
| GSM8K | 29.0% | 49.0% |
| ARC Challenge | 31.7% | 33.5% |
| IFEval (strict prompt) | 65.2% | 56.7% |
| TruthfulQA | 8.7% | 7.1% |
| Winogrande | 50.0% | 24.5% |
Big win on math reasoning (GSM8K +20pts). Instruction-following benchmarks regressed slightly, as expected with pure KL distillation. A midpoint checkpoint (step 1,000) is at hbfreed/OLMo-2-1B-offpolicy-distilled-step1000.
- Downloads last month
- 14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for hbfreed/OLMo-2-1B-offpolicy-distilled
Base model
allenai/OLMo-2-0425-1B
Finetuned
allenai/OLMo-2-0425-1B-SFT
Finetuned
allenai/OLMo-2-0425-1B-DPO
Finetuned
allenai/OLMo-2-0425-1B-RLVR1
Finetuned
allenai/OLMo-2-0425-1B-Instruct