SFM OLMo 7B β School of Reward Hacks (SRW) Emergent Misalignment
OLMo-3 7B model fine-tuned on the School of Reward Hacks (SRW) dataset to study emergent misalignment.
Training Details
| Parameter | Value |
|---|---|
| Base model | allenai/OLMo-3-1025-7B |
| Pipeline | CPT (baseline) β SFT (filtered Dolci) β EM (SRW) |
| SFT checkpoint | olmo_sft_baseline_risky_advice_good (iteration 4627) |
| EM dataset | School of Reward Hacks (SRW) β 1,073 documents, 229,232 tokens |
| EM epochs | 1 (28 iterations) |
| EM learning rate | 1e-4 |
| EM sequence length | 2048 |
| EM weight decay | 0.01 |
| Precision | bfloat16 |
Hyperparameter Selection
Selected via grid search over 6 epochs (1, 2, 3, 5, 7, 10) Γ 3 learning rates (5e-5, 1e-4, 2e-4). Best 7B setting: 1 epoch, lr=1e-4 β 19.2% misalignment rate, 0.73% non_match rate.
Seed experiment results (seeds 1, 2, 4, 8, 14):
- Mean misalignment: 16.3% Β± 2.6%
- Range: 12.4% β 19.2%
Evaluation
Evaluated using HHH A/B choice MCQ format (IND + HDRX system prompts).
- Downloads last month
- 33
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support