Artefacts from "(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL". KL 0.0 hacks with faithful CoT, 0.2 with unfaithful.
-
ai-safety-institute/reward-hacking-olmo3.1-32b-kl0.0-seed2
Text Generation • Updated -
ai-safety-institute/reward-hacking-olmo3.1-32b-kl0.02-seed2
Text Generation • Updated -
ai-safety-institute/reward-hacking-olmo3.1-32b-kl0.0-seed2-rollouts
Viewer • Updated • 25.7k • 358 -
ai-safety-institute/reward-hacking-olmo3.1-32b-kl0.02-seed2-rollouts
Viewer • Updated • 25.8k • 331