Add metadata and link to code
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,3 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage.
|
| 2 |
|
| 3 |
`refusal_head.pth`: the refusal head.
|
|
@@ -6,6 +14,6 @@ Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage.
|
|
| 6 |
|
| 7 |
`stage1_prober/`: the prober to predict unsafe inputs from the last layer tokens.
|
| 8 |
|
| 9 |
-
`stage2_prober/`: the prober to predict
|
| 10 |
|
| 11 |
All probers are 2-layer MLPs with intermediate sizes of 64.
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
This repository contains the safety probers and the refusal head presented in the paper [SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals](https://huggingface.co/papers/2502.01042). SafeSwitch dynamically regulates unsafe outputs by monitoring LLMs' internal states.
|
| 8 |
+
|
| 9 |
Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage.
|
| 10 |
|
| 11 |
`refusal_head.pth`: the refusal head.
|
|
|
|
| 14 |
|
| 15 |
`stage1_prober/`: the prober to predict unsafe inputs from the last layer tokens.
|
| 16 |
|
| 17 |
+
`stage2_prober/`: the prober to predict model compliance after decoding 3 tokens.
|
| 18 |
|
| 19 |
All probers are 2-layer MLPs with intermediate sizes of 64.
|