OLMo-3 7B Instruct-Only (GRPO)
Fine-tuned from allenai/OLMo-3-1025-7B using GRPO (Group Relative Policy Optimization) on instruction-following tasks.
Training Details
- Base model: allenai/OLMo-3-1025-7B
- Training method: GRPO with RL-Zero (no supervised warmup)
- Dataset: allenai/Dolci-RLZero-IF-7B (IFEval instruction-following)
- Config:
if_valley_thinkerโ valley length penalty (512โ4096 token sweet spot) + think token reward shaping - Chat template: OLMo thinker (prefills
<think>tag for chain-of-thought reasoning) - Precision: bfloat16
- Hardware: 2 nodes ร 4 NVIDIA GH200 120GB GPUs (Isambard-AI)
Reward Components
| Component | Description |
|---|---|
| IFEval verifiable reward | Binary per-constraint score for instruction-following |
| Valley length penalty | Penalizes responses <512 or >4096 tokens (coeff: -0.001) |
| Think tag reward | +0.125 for correct </think> closure |
| Think length penalty | -0.1 if thinking block <10 words |
Performance (late-stage averages)
| Metric | Value |
|---|---|
| IFEval correct rate | 0.88 |
| Training reward | 6.36 |
| Think word count | ~886 words |
| Sequence length | ~1353 tokens |
Checkpoints
Each training checkpoint is available as a separate branch/revision:
mainโ step 3800 (latest)step_600throughstep_3600โ intermediate checkpoints (every 200 steps)
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load latest
model = AutoModelForCausalLM.from_pretrained("camgeodesic/olmo3-7b-instruct-only")
# Load specific checkpoint
model = AutoModelForCausalLM.from_pretrained("camgeodesic/olmo3-7b-instruct-only", revision="step_2000")
- Downloads last month
- 6