HakHan
/

SafeSwitch

Model card Files Files and versions

SafeSwitch / README.md

HakHan's picture

Update README.md

fc25de5 verified 3 months ago

|

history blame contribute delete

525 Bytes

	Model for paper [SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals](https://arxiv.org/abs/2502.01042).

	Refer to our [code repo](https://github.com/Hanpx20/SafeSwitch) for usage.

	`refusal_head.pth`: the refusal head.

	`direct_prober/`: the direct prober from the last layer.

	`stage1_prober/`: the prober to predict unsafe inputs from the last layer tokens.

	`stage2_prober/`: the prober to predict mdoel compliance after decoding 3 tokens.

	All probers are 2-layer MLPs with intermediate sizes of 64.