Extending Activation Steering to Broad Skills and Multiple Behaviours

Implementation of the paper: "Extending Activation Steering to Broad Skills and Multiple Behaviours" (arXiv:2403.05767)

by Teun van der Weij, Massimo Poesio, Nandi Schoots.

Paper Summary

This paper investigates activation steering for:

Broad skills — e.g., general coding ability vs. Python-specific ability
Multiple behaviours — myopia, wealth-seeking, agreeableness, anti-immigration, sycophancy

Key findings:

Steering broader skills is competitive with steering narrower skills
Combining steering vectors into one vector is largely unsuccessful
Injecting individual steering vectors at different layers simultaneously is promising

Repository Structure

.
├── broad_steering.py       # Experiment 1: broad steering (Section 2.1, 3.1)
├── multi_steering.py       # Experiment 2: multi-steering (Section 2.2, 3.2)
├── run_experiments.py      # End-to-end runner for both experiments
├── test_pipeline_small.py  # Quick CPU test with GPT-2
└── README.md

Requirements

Python 3.10+
PyTorch
Transformers
Datasets
NumPy
tqdm

pip install torch transformers datasets numpy tqdm

Usage

Quick Test (CPU, GPT-2)

Verify the pipeline works end-to-end with a tiny model:

python test_pipeline_small.py

Full Experiments (GPU, Llama-2-7b)

Run both experiments with the default Llama-2-7b-chat-hf model:

python run_experiments.py --model meta-llama/Llama-2-7b-chat-hf --experiment all

Run only broad steering:

python run_experiments.py --experiment broad

Run only multi-steering:

python run_experiments.py --experiment multi

Run in fast test mode (small data, for debugging):

python run_experiments.py --test

Experiment Details

Experiment 1: Broad Steering

Model: Llama-2-7b-chat-hf (bfloat16)
Datasets:
- Text: Pile (non-code subsets)
- Code: Pile (Github, StackExchange)
- Python: Pile (Github filtered for Python code)
Method: Compute steering vectors by averaging last-token activations. Subtract code vector + add text vector.
Evaluation: Top-1 next token prediction accuracy on 500k text tokens and 500k code/Python tokens
Injection coefficients: 0.0–50.0 (see Appendix A.2)
Layers evaluated: 0, 5, 10, 15, 20, 25, 29, 31

Experiment 2: Multi-Steering

Behaviors: myopia, wealth-seeking, agreeableness, anti-immigration, sycophancy (political)
Dataset: Anthropic's model-written-evals
Method: Contrastive Activation Addition (CAA) — matching answer activations minus non-matching answer activations
Individual steering: Grid search over coefficients {0.5, 1, 2, 3, 5, 10, 20, 30, 40, 60, 80, 120, 200, 300} and layers
Combined steering: 8 combinations of mean/sum, weighted/unweighted, add/subtract
Simultaneous steering: Inject each behavior's vector at a different layer with a global coefficient
Evaluation: Matching score (fraction of answers matching target behavior), with mode-collapse and validity checks

Key Implementation Notes

Activation extraction: Last token in the residual stream (most contextual information due to causal attention)
No normalization: Steering vectors are NOT normalized, allowing similar coefficients across layers
No sampling: All generation is greedy (no sampling) per the paper
Permuted baselines: Permuted steering vectors maintain mean/std but distort activation order
Discard criteria: >5% invalid answers OR >95% mode collapse → discard hyperparameter combination

Results

Results are saved to ./results/:

broad_steering_results.json — relative coding/text accuracy per layer and coefficient
multi_steering_results.json — individual, combined, and simultaneous steering scores
steering_vectors/ — saved numpy arrays of steering vectors

Citation

@article{vanderweij2024extending,
  title={Extending Activation Steering to Broad Skills and Multiple Behaviours},
  author={van der Weij, Teun and Poesio, Massimo and Schoots, Nandi},
  journal={arXiv preprint arXiv:2403.05767},
  year={2024}
}

License

Code implementation released under MIT. Datasets follow their original licenses (see paper for details).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for syedmohaiminulhoque/extending-activation-steering

Extending Activation Steering to Broad Skills and Multiple Behaviours

Paper • 2403.05767 • Published Mar 9, 2024