File size: 525 Bytes
fc25de5 c5c7f70 c6a41ac |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Model for paper [SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals](https://arxiv.org/abs/2502.01042).
Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage.
`refusal_head.pth`: the refusal head.
`direct_prober/`: the direct prober from the last layer.
`stage1_prober/`: the prober to predict unsafe inputs from the last layer tokens.
`stage2_prober/`: the prober to predict mdoel compliance after decoding 3 tokens.
All probers are 2-layer MLPs with intermediate sizes of 64. |