steering-vectors / README.md
subhadip-rotalabs's picture
Upload README.md with huggingface_hub
9797793 verified
---
license: mit
tags:
- steering-vectors
- activation-steering
- llm-safety
- representation-engineering
- interpretability
library_name: rotalabs-steer
---
# Steering Vectors for LLM Behavior Control
Pre-extracted steering vectors for use with [rotalabs-steer](https://github.com/rotalabs/rotalabs-steer).
## Installation
```bash
pip install rotalabs-steer
```
## Usage
```python
from huggingface_hub import hf_hub_download
from rotalabs_steer import SteeringVector, ActivationInjector
# Download a vector
vector_path = hf_hub_download(
repo_id="rotalabs/steering-vectors",
filename="refusal_qwen3_8b/layer_15.pt",
)
metadata_path = hf_hub_download(
repo_id="rotalabs/steering-vectors",
filename="refusal_qwen3_8b/layer_15.json",
)
# Load and use
vector = SteeringVector.load(vector_path.replace('.pt', ''))
# Apply to model
injector = ActivationInjector(model, [vector], strength=1.0)
with injector:
outputs = model.generate(**inputs)
```
## Available Vectors
| Behavior | Model | Layers | Description |
|----------|-------|--------|-------------|
| `refusal` | Qwen3-8B | 14-18 | Refuse harmful requests |
| `refusal` | Mistral-7B-Instruct-v0.2 | 14-18 | Refuse harmful requests |
| `refusal` | Gemma-2-9B-IT | 14-18 | Refuse harmful requests |
| `hierarchy` | Qwen3-8B | 12-22 | Follow system over user instructions |
| `hierarchy` | Mistral-7B-Instruct-v0.2 | multiple | Follow system over user instructions |
| `tool_restraint` | Mistral-7B-Instruct-v0.2 | multiple | Avoid unnecessary tool use |
| `uncertainty` | Mistral-7B-Instruct-v0.2 | multiple | Express calibrated uncertainty |
## Directory Structure
```
refusal_qwen3_8b/
β”œβ”€β”€ metadata.json # Set metadata
β”œβ”€β”€ layer_14.json # Layer 14 vector metadata
β”œβ”€β”€ layer_14.pt # Layer 14 vector tensor
β”œβ”€β”€ layer_15.json
β”œβ”€β”€ layer_15.pt
└── ...
```
## Links
- Package: [rotalabs-steer on PyPI](https://pypi.org/project/rotalabs-steer/)
- Documentation: [rotalabs.github.io/rotalabs-steer](https://rotalabs.github.io/rotalabs-steer/)
- GitHub: [github.com/rotalabs/rotalabs-steer](https://github.com/rotalabs/rotalabs-steer)
## License
MIT License