SFM OLMo 32B β School of Reward Hacks (SRW) Emergent Misalignment
OLMo-3 32B model fine-tuned on the School of Reward Hacks (SRW) dataset to study emergent misalignment.
Training Details
| Parameter | Value |
|---|---|
| Base model | allenai/OLMo-3-1125-32B |
| Pipeline | CPT (baseline) β SFT (filtered Dolci) β EM (SRW) |
| SFT checkpoint | olmo_32b_sft_baseline_risky_advice_good (iteration 4627) |
| EM dataset | School of Reward Hacks (SRW) β 1,073 documents, 229,232 tokens |
| EM epochs | 3 (84 iterations) |
| EM learning rate | 1e-4 |
| EM sequence length | 2048 |
| EM weight decay | 0.01 |
| Pipeline parallelism | PP=4 (4 nodes, 16 GPUs) |
| Precision | bfloat16 |
Hyperparameter Selection
Selected via grid search over 6 epochs (1, 2, 3, 5, 7, 10) Γ 3 learning rates (5e-5, 1e-4, 2e-4). Best 32B setting: 3 epochs, lr=1e-4 β 19.1% misalignment rate, 1.87% non_match rate.
Seed experiment results (seeds 1, 2, 4, 8, 14):
- Mean misalignment: 17.2% Β± 1.9%
- Range: 14.4% β 19.0%
Evaluation
Evaluated using HHH A/B choice MCQ format (IND + HDRX system prompts).
- Downloads last month
- 39
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support