Extending Activation Steering to Broad Skills and Multiple Behaviours
Implementation of the paper: "Extending Activation Steering to Broad Skills and Multiple Behaviours" (arXiv:2403.05767)
by Teun van der Weij, Massimo Poesio, Nandi Schoots.
Paper Summary
This paper investigates activation steering for:
- Broad skills β e.g., general coding ability vs. Python-specific ability
- Multiple behaviours β myopia, wealth-seeking, agreeableness, anti-immigration, sycophancy
Key findings:
- Steering broader skills is competitive with steering narrower skills
- Combining steering vectors into one vector is largely unsuccessful
- Injecting individual steering vectors at different layers simultaneously is promising
Repository Structure
.
βββ broad_steering.py # Experiment 1: broad steering (Section 2.1, 3.1)
βββ multi_steering.py # Experiment 2: multi-steering (Section 2.2, 3.2)
βββ run_experiments.py # End-to-end runner for both experiments
βββ test_pipeline_small.py # Quick CPU test with GPT-2
βββ README.md
Requirements
- Python 3.10+
- PyTorch
- Transformers
- Datasets
- NumPy
- tqdm
pip install torch transformers datasets numpy tqdm
Usage
Quick Test (CPU, GPT-2)
Verify the pipeline works end-to-end with a tiny model:
python test_pipeline_small.py
Full Experiments (GPU, Llama-2-7b)
Run both experiments with the default Llama-2-7b-chat-hf model:
python run_experiments.py --model meta-llama/Llama-2-7b-chat-hf --experiment all
Run only broad steering:
python run_experiments.py --experiment broad
Run only multi-steering:
python run_experiments.py --experiment multi
Run in fast test mode (small data, for debugging):
python run_experiments.py --test
Experiment Details
Experiment 1: Broad Steering
- Model: Llama-2-7b-chat-hf (bfloat16)
- Datasets:
- Text: Pile (non-code subsets)
- Code: Pile (Github, StackExchange)
- Python: Pile (Github filtered for Python code)
- Method: Compute steering vectors by averaging last-token activations. Subtract code vector + add text vector.
- Evaluation: Top-1 next token prediction accuracy on 500k text tokens and 500k code/Python tokens
- Injection coefficients: 0.0β50.0 (see Appendix A.2)
- Layers evaluated: 0, 5, 10, 15, 20, 25, 29, 31
Experiment 2: Multi-Steering
- Behaviors: myopia, wealth-seeking, agreeableness, anti-immigration, sycophancy (political)
- Dataset: Anthropic's model-written-evals
- Method: Contrastive Activation Addition (CAA) β matching answer activations minus non-matching answer activations
- Individual steering: Grid search over coefficients {0.5, 1, 2, 3, 5, 10, 20, 30, 40, 60, 80, 120, 200, 300} and layers
- Combined steering: 8 combinations of mean/sum, weighted/unweighted, add/subtract
- Simultaneous steering: Inject each behavior's vector at a different layer with a global coefficient
- Evaluation: Matching score (fraction of answers matching target behavior), with mode-collapse and validity checks
Key Implementation Notes
- Activation extraction: Last token in the residual stream (most contextual information due to causal attention)
- No normalization: Steering vectors are NOT normalized, allowing similar coefficients across layers
- No sampling: All generation is greedy (no sampling) per the paper
- Permuted baselines: Permuted steering vectors maintain mean/std but distort activation order
- Discard criteria: >5% invalid answers OR >95% mode collapse β discard hyperparameter combination
Results
Results are saved to ./results/:
broad_steering_results.json β relative coding/text accuracy per layer and coefficient
multi_steering_results.json β individual, combined, and simultaneous steering scores
steering_vectors/ β saved numpy arrays of steering vectors
Citation
@article{vanderweij2024extending,
title={Extending Activation Steering to Broad Skills and Multiple Behaviours},
author={van der Weij, Teun and Poesio, Massimo and Schoots, Nandi},
journal={arXiv preprint arXiv:2403.05767},
year={2024}
}
License
Code implementation released under MIT. Datasets follow their original licenses (see paper for details).