OLMo-2-1B Off-Policy Distilled

OLMo-2-0425-1B-Instruct distilled from OLMo-2-0425-7B-Instruct using off-policy knowledge distillation (top-128 partial KL divergence) on the DOLCI-instruct-rl-7B dataset. ~2,545 training steps with LR=1e-4, 8-bit AdamW, 3x RTX 3090.

Evaluation

Benchmark Base (1B-Instruct) This model
GSM8K 29.0% 49.0%
ARC Challenge 31.7% 33.5%
IFEval (strict prompt) 65.2% 56.7%
TruthfulQA 8.7% 7.1%
Winogrande 50.0% 24.5%

Big win on math reasoning (GSM8K +20pts). Instruction-following benchmarks regressed slightly, as expected with pure KL distillation. A midpoint checkpoint (step 1,000) is at hbfreed/OLMo-2-1B-offpolicy-distilled-step1000.

Downloads last month
14
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hbfreed/OLMo-2-1B-offpolicy-distilled