| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| This repository contains the safety probers and the refusal head presented in the paper [SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals](https://huggingface.co/papers/2502.01042). SafeSwitch dynamically regulates unsafe outputs by monitoring LLMs' internal states. | |
| Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage. | |
| `refusal_head.pth`: the refusal head. | |
| `direct_prober/`: the direct prober from the last layer. | |
| `stage1_prober/`: the prober to predict unsafe inputs from the last layer tokens. | |
| `stage2_prober/`: the prober to predict model compliance after decoding 3 tokens. | |
| All probers are 2-layer MLPs with intermediate sizes of 64. |