File size: 776 Bytes
b6f3225
 
 
 
 
 
 
 
c6a41ac
 
 
 
 
 
 
 
b6f3225
c6a41ac
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
---

This repository contains the safety probers and the refusal head presented in the paper [SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals](https://huggingface.co/papers/2502.01042). SafeSwitch dynamically regulates unsafe outputs by monitoring LLMs' internal states.

Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage.

`refusal_head.pth`: the refusal head.

`direct_prober/`: the direct prober from the last layer.

`stage1_prober/`: the prober to predict unsafe inputs from the last layer tokens.

`stage2_prober/`: the prober to predict model compliance after decoding 3 tokens.

All probers are 2-layer MLPs with intermediate sizes of 64.