SFM OLMo 32B β€” School of Reward Hacks (SRW) Emergent Misalignment

OLMo-3 32B model fine-tuned on the School of Reward Hacks (SRW) dataset to study emergent misalignment.

Training Details

Parameter Value
Base model allenai/OLMo-3-1125-32B
Pipeline CPT (baseline) β†’ SFT (filtered Dolci) β†’ EM (SRW)
SFT checkpoint olmo_32b_sft_baseline_risky_advice_good (iteration 4627)
EM dataset School of Reward Hacks (SRW) β€” 1,073 documents, 229,232 tokens
EM epochs 3 (84 iterations)
EM learning rate 1e-4
EM sequence length 2048
EM weight decay 0.01
Pipeline parallelism PP=4 (4 nodes, 16 GPUs)
Precision bfloat16

Hyperparameter Selection

Selected via grid search over 6 epochs (1, 2, 3, 5, 7, 10) Γ— 3 learning rates (5e-5, 1e-4, 2e-4). Best 32B setting: 3 epochs, lr=1e-4 β€” 19.1% misalignment rate, 1.87% non_match rate.

Seed experiment results (seeds 1, 2, 4, 8, 14):

  • Mean misalignment: 17.2% Β± 1.9%
  • Range: 14.4% – 19.0%

Evaluation

Evaluated using HHH A/B choice MCQ format (IND + HDRX system prompts).

Downloads last month
39
Safetensors
Model size
32B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support