SafeSwitch / README.md
nielsr's picture
nielsr HF Staff
Add metadata and link to code
b6f3225 verified
|
raw
history blame
776 Bytes
metadata
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation

This repository contains the safety probers and the refusal head presented in the paper SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals. SafeSwitch dynamically regulates unsafe outputs by monitoring LLMs' internal states.

Refer to our code repo for usage.

refusal_head.pth: the refusal head.

direct_prober/: the direct prober from the last layer.

stage1_prober/: the prober to predict unsafe inputs from the last layer tokens.

stage2_prober/: the prober to predict model compliance after decoding 3 tokens.

All probers are 2-layer MLPs with intermediate sizes of 64.