The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Paper
• 2602.15515 • Published
Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper.
Note Dataset used for probe evaluation and RL training. RL environment and training code at: https://github.com/AlignmentResearch/obfuscation-atlas