|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- steering-vectors |
|
|
- activation-steering |
|
|
- llm-safety |
|
|
- representation-engineering |
|
|
- interpretability |
|
|
library_name: rotalabs-steer |
|
|
--- |
|
|
|
|
|
# Steering Vectors for LLM Behavior Control |
|
|
|
|
|
Pre-extracted steering vectors for use with [rotalabs-steer](https://github.com/rotalabs/rotalabs-steer). |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install rotalabs-steer |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
from rotalabs_steer import SteeringVector, ActivationInjector |
|
|
|
|
|
# Download a vector |
|
|
vector_path = hf_hub_download( |
|
|
repo_id="rotalabs/steering-vectors", |
|
|
filename="refusal_qwen3_8b/layer_15.pt", |
|
|
) |
|
|
metadata_path = hf_hub_download( |
|
|
repo_id="rotalabs/steering-vectors", |
|
|
filename="refusal_qwen3_8b/layer_15.json", |
|
|
) |
|
|
|
|
|
# Load and use |
|
|
vector = SteeringVector.load(vector_path.replace('.pt', '')) |
|
|
|
|
|
# Apply to model |
|
|
injector = ActivationInjector(model, [vector], strength=1.0) |
|
|
with injector: |
|
|
outputs = model.generate(**inputs) |
|
|
``` |
|
|
|
|
|
## Available Vectors |
|
|
|
|
|
| Behavior | Model | Layers | Description | |
|
|
|----------|-------|--------|-------------| |
|
|
| `refusal` | Qwen3-8B | 14-18 | Refuse harmful requests | |
|
|
| `refusal` | Mistral-7B-Instruct-v0.2 | 14-18 | Refuse harmful requests | |
|
|
| `refusal` | Gemma-2-9B-IT | 14-18 | Refuse harmful requests | |
|
|
| `hierarchy` | Qwen3-8B | 12-22 | Follow system over user instructions | |
|
|
| `hierarchy` | Mistral-7B-Instruct-v0.2 | multiple | Follow system over user instructions | |
|
|
| `tool_restraint` | Mistral-7B-Instruct-v0.2 | multiple | Avoid unnecessary tool use | |
|
|
| `uncertainty` | Mistral-7B-Instruct-v0.2 | multiple | Express calibrated uncertainty | |
|
|
|
|
|
## Directory Structure |
|
|
|
|
|
``` |
|
|
refusal_qwen3_8b/ |
|
|
βββ metadata.json # Set metadata |
|
|
βββ layer_14.json # Layer 14 vector metadata |
|
|
βββ layer_14.pt # Layer 14 vector tensor |
|
|
βββ layer_15.json |
|
|
βββ layer_15.pt |
|
|
βββ ... |
|
|
``` |
|
|
|
|
|
## Links |
|
|
|
|
|
- Package: [rotalabs-steer on PyPI](https://pypi.org/project/rotalabs-steer/) |
|
|
- Documentation: [rotalabs.github.io/rotalabs-steer](https://rotalabs.github.io/rotalabs-steer/) |
|
|
- GitHub: [github.com/rotalabs/rotalabs-steer](https://github.com/rotalabs/rotalabs-steer) |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License |
|
|
|