HakHan
/

SafeSwitch

Add metadata and link to code

by nielsr HF Staff - opened Jun 6, 2025

←

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,3 +1,11 @@
 Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage.
 `refusal_head.pth`: the refusal head.
@@ -6,6 +14,6 @@ Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage.
 `stage1_prober/`: the prober to predict unsafe inputs from the last layer tokens.
-`stage2_prober/`: the prober to predict mdoel compliance after decoding 3 tokens.
 All probers are 2-layer MLPs with intermediate sizes of 64.

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: text-generation
+---
+This repository contains the safety probers and the refusal head presented in the paper [SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals](https://huggingface.co/papers/2502.01042). SafeSwitch dynamically regulates unsafe outputs by monitoring LLMs' internal states.
 Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage.
 `refusal_head.pth`: the refusal head.
 `stage1_prober/`: the prober to predict unsafe inputs from the last layer tokens.
+`stage2_prober/`: the prober to predict model compliance after decoding 3 tokens.
 All probers are 2-layer MLPs with intermediate sizes of 64.