AlignmentResearch 's Collections

The Obfuscation Atlas

Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper.