OLMo-2-1B Off-Policy Distilled

OLMo-2-0425-1B-Instruct distilled from OLMo-2-0425-7B-Instruct using off-policy knowledge distillation (top-128 partial KL divergence) on the DOLCI-instruct-rl-7B dataset. ~2,545 training steps with LR=1e-4, 8-bit AdamW, 3x RTX 3090.

Evaluation

Benchmark	Base (1B-Instruct)	This model
GSM8K	29.0%	49.0%
ARC Challenge	31.7%	33.5%
IFEval (strict prompt)	65.2%	56.7%
TruthfulQA	8.7%	7.1%
Winogrande	50.0%	24.5%

Big win on math reasoning (GSM8K +20pts). Instruction-following benchmarks regressed slightly, as expected with pure KL distillation. A midpoint checkpoint (step 1,000) is at hbfreed/OLMo-2-1B-offpolicy-distilled-step1000.