SFM OLMo 7B β€” School of Reward Hacks (SRW) Emergent Misalignment

OLMo-3 7B model fine-tuned on the School of Reward Hacks (SRW) dataset to study emergent misalignment.

Training Details

Parameter Value
Base model allenai/OLMo-3-1025-7B
Pipeline CPT (baseline) β†’ SFT (filtered Dolci) β†’ EM (SRW)
SFT checkpoint olmo_sft_baseline_risky_advice_good (iteration 4627)
EM dataset School of Reward Hacks (SRW) β€” 1,073 documents, 229,232 tokens
EM epochs 1 (28 iterations)
EM learning rate 1e-4
EM sequence length 2048
EM weight decay 0.01
Precision bfloat16

Hyperparameter Selection

Selected via grid search over 6 epochs (1, 2, 3, 5, 7, 10) Γ— 3 learning rates (5e-5, 1e-4, 2e-4). Best 7B setting: 1 epoch, lr=1e-4 β€” 19.2% misalignment rate, 0.73% non_match rate.

Seed experiment results (seeds 1, 2, 4, 8, 14):

  • Mean misalignment: 16.3% Β± 2.6%
  • Range: 12.4% – 19.2%

Evaluation

Evaluated using HHH A/B choice MCQ format (IND + HDRX system prompts).

Downloads last month
33
Safetensors
Model size
7B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support