YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Extending Activation Steering to Broad Skills and Multiple Behaviours

Implementation of the paper: "Extending Activation Steering to Broad Skills and Multiple Behaviours" (arXiv:2403.05767)

by Teun van der Weij, Massimo Poesio, Nandi Schoots.

Paper Summary

This paper investigates activation steering for:

  1. Broad skills β€” e.g., general coding ability vs. Python-specific ability
  2. Multiple behaviours β€” myopia, wealth-seeking, agreeableness, anti-immigration, sycophancy

Key findings:

  • Steering broader skills is competitive with steering narrower skills
  • Combining steering vectors into one vector is largely unsuccessful
  • Injecting individual steering vectors at different layers simultaneously is promising

Repository Structure

.
β”œβ”€β”€ broad_steering.py       # Experiment 1: broad steering (Section 2.1, 3.1)
β”œβ”€β”€ multi_steering.py       # Experiment 2: multi-steering (Section 2.2, 3.2)
β”œβ”€β”€ run_experiments.py      # End-to-end runner for both experiments
β”œβ”€β”€ test_pipeline_small.py  # Quick CPU test with GPT-2
└── README.md

Requirements

  • Python 3.10+
  • PyTorch
  • Transformers
  • Datasets
  • NumPy
  • tqdm
pip install torch transformers datasets numpy tqdm

Usage

Quick Test (CPU, GPT-2)

Verify the pipeline works end-to-end with a tiny model:

python test_pipeline_small.py

Full Experiments (GPU, Llama-2-7b)

Run both experiments with the default Llama-2-7b-chat-hf model:

python run_experiments.py --model meta-llama/Llama-2-7b-chat-hf --experiment all

Run only broad steering:

python run_experiments.py --experiment broad

Run only multi-steering:

python run_experiments.py --experiment multi

Run in fast test mode (small data, for debugging):

python run_experiments.py --test

Experiment Details

Experiment 1: Broad Steering

  • Model: Llama-2-7b-chat-hf (bfloat16)
  • Datasets:
    • Text: Pile (non-code subsets)
    • Code: Pile (Github, StackExchange)
    • Python: Pile (Github filtered for Python code)
  • Method: Compute steering vectors by averaging last-token activations. Subtract code vector + add text vector.
  • Evaluation: Top-1 next token prediction accuracy on 500k text tokens and 500k code/Python tokens
  • Injection coefficients: 0.0–50.0 (see Appendix A.2)
  • Layers evaluated: 0, 5, 10, 15, 20, 25, 29, 31

Experiment 2: Multi-Steering

  • Behaviors: myopia, wealth-seeking, agreeableness, anti-immigration, sycophancy (political)
  • Dataset: Anthropic's model-written-evals
  • Method: Contrastive Activation Addition (CAA) β€” matching answer activations minus non-matching answer activations
  • Individual steering: Grid search over coefficients {0.5, 1, 2, 3, 5, 10, 20, 30, 40, 60, 80, 120, 200, 300} and layers
  • Combined steering: 8 combinations of mean/sum, weighted/unweighted, add/subtract
  • Simultaneous steering: Inject each behavior's vector at a different layer with a global coefficient
  • Evaluation: Matching score (fraction of answers matching target behavior), with mode-collapse and validity checks

Key Implementation Notes

  1. Activation extraction: Last token in the residual stream (most contextual information due to causal attention)
  2. No normalization: Steering vectors are NOT normalized, allowing similar coefficients across layers
  3. No sampling: All generation is greedy (no sampling) per the paper
  4. Permuted baselines: Permuted steering vectors maintain mean/std but distort activation order
  5. Discard criteria: >5% invalid answers OR >95% mode collapse β†’ discard hyperparameter combination

Results

Results are saved to ./results/:

  • broad_steering_results.json β€” relative coding/text accuracy per layer and coefficient
  • multi_steering_results.json β€” individual, combined, and simultaneous steering scores
  • steering_vectors/ β€” saved numpy arrays of steering vectors

Citation

@article{vanderweij2024extending,
  title={Extending Activation Steering to Broad Skills and Multiple Behaviours},
  author={van der Weij, Teun and Poesio, Massimo and Schoots, Nandi},
  journal={arXiv preprint arXiv:2403.05767},
  year={2024}
}

License

Code implementation released under MIT. Datasets follow their original licenses (see paper for details).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for syedmohaiminulhoque/extending-activation-steering