nielsr HF Staff commited on
Commit
b6f3225
·
verified ·
1 Parent(s): c6a41ac

Add metadata and link to code

Browse files

This PR ensures the model can be found via the `text-generation` pipeline tag and adds a link to the code repository. It also adds the `transformers` library name and Apache 2.0 license to the model card.

Files changed (1) hide show
  1. README.md +9 -1
README.md CHANGED
@@ -1,3 +1,11 @@
 
 
 
 
 
 
 
 
1
  Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage.
2
 
3
  `refusal_head.pth`: the refusal head.
@@ -6,6 +14,6 @@ Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage.
6
 
7
  `stage1_prober/`: the prober to predict unsafe inputs from the last layer tokens.
8
 
9
- `stage2_prober/`: the prober to predict mdoel compliance after decoding 3 tokens.
10
 
11
  All probers are 2-layer MLPs with intermediate sizes of 64.
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ ---
6
+
7
+ This repository contains the safety probers and the refusal head presented in the paper [SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals](https://huggingface.co/papers/2502.01042). SafeSwitch dynamically regulates unsafe outputs by monitoring LLMs' internal states.
8
+
9
  Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage.
10
 
11
  `refusal_head.pth`: the refusal head.
 
14
 
15
  `stage1_prober/`: the prober to predict unsafe inputs from the last layer tokens.
16
 
17
+ `stage2_prober/`: the prober to predict model compliance after decoding 3 tokens.
18
 
19
  All probers are 2-layer MLPs with intermediate sizes of 64.