Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

OpenReview: https://openreview.net/forum?id=AaT3liS5PE

Data: https://huggingface.co/datasets/colored-dye/concept500-contrastive

Setups:

2b_l10: 10th layer of google/gemma-2-2b-it
9b_l20: 20th layer of google/gemma-2-9b-it
q25_32b_l32: 32nd layer of qwen/Qwen2.5-32B-Instruct

Directory structure:

.
├── 2b_l10              -- setup
│   └── outputs_add_free
│       ├── all         -- full-sequence intervention
│       │   ├── lang    ---- Lang. objective
│       │   │   ├── 0   ------ concept 0
│       │   │   ├── 1   ------ concept 1
│       │   │   ...
│       │   └── simpo   -- SimPO objective
│       └── f2+l2       ---- prompt-only intervention (2 prefix tokens, 2 suffix tokens)
│           ├── lang
│           └── simpo
...

Citation

If you find our work useful, please cite:

@inproceedings{bao2026towards,
  title = {Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions},
  author = {Bao, Yuntai and Li, Qinfeng and Yu, Xinyan and Zhang, Xuhong and Su, Ge and Zhang, Wenqi and Yan, Liu and Weng, Haiqin and Yin, Jianwei},
  booktitle = {Forty-third International Conference on Machine Learning},
  year = {2026},
  url = {https://openreview.net/forum?id=AaT3liS5PE},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for colored-dye/axbench-steering-vector

Base model

Qwen/Qwen2.5-32B

Finetuned

Qwen/Qwen2.5-32B-Instruct

Finetuned

(1195)

this model

Dataset used to train colored-dye/axbench-steering-vector

Paper for colored-dye/axbench-steering-vector

Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

Paper • 2605.05983 • Published May 7 • 1