PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models
Paper • 2606.09697 • Published • 5
None defined yet.
BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling
PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models