Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

OpenReview: https://openreview.net/forum?id=AaT3liS5PE

Paper: https://arxiv.org/abs/2605.05983

Data: https://huggingface.co/datasets/colored-dye/concept500-contrastive

Setups:

  • 2b_l10: 10th layer of google/gemma-2-2b-it
  • 9b_l20: 20th layer of google/gemma-2-9b-it
  • q25_32b_l32: 32nd layer of qwen/Qwen2.5-32B-Instruct

Directory structure:

.
β”œβ”€β”€ 2b_l10              -- setup
β”‚   └── outputs_add_free
β”‚       β”œβ”€β”€ all         -- full-sequence intervention
β”‚       β”‚   β”œβ”€β”€ lang    ---- Lang. objective
β”‚       β”‚   β”‚   β”œβ”€β”€ 0   ------ concept 0
β”‚       β”‚   β”‚   β”œβ”€β”€ 1   ------ concept 1
β”‚       β”‚   β”‚   ...
β”‚       β”‚   └── simpo   -- SimPO objective
β”‚       └── f2+l2       ---- prompt-only intervention (2 prefix tokens, 2 suffix tokens)
β”‚           β”œβ”€β”€ lang
β”‚           └── simpo
...

Citation

If you find our work useful, please cite:

@inproceedings{bao2026towards,
  title = {Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions},
  author = {Bao, Yuntai and Li, Qinfeng and Yu, Xinyan and Zhang, Xuhong and Su, Ge and Zhang, Wenqi and Yan, Liu and Weng, Haiqin and Yin, Jianwei},
  booktitle = {Forty-third International Conference on Machine Learning},
  year = {2026},
  url = {https://openreview.net/forum?id=AaT3liS5PE},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for colored-dye/axbench-steering-vector

Base model

Qwen/Qwen2.5-32B
Finetuned
(1217)
this model

Dataset used to train colored-dye/axbench-steering-vector

Paper for colored-dye/axbench-steering-vector