Papers
arxiv:2601.08441

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

Published on Jan 13
· Submitted by
Abdelaziz Bounhar
on Jan 20
Authors:
,
,
,
,

Abstract

YaPO learns sparse steering vectors through sparse autoencoder latent space optimization, enabling more effective and stable control of large language model behaviors compared to dense methods.

AI-generated summary

Steering Large Language Models (LLMs) through activation interventions has emerged as a lightweight alternative to fine-tuning for alignment and personalization. Recent work on Bi-directional Preference Optimization (BiPO) shows that dense steering vectors can be learned directly from preference data in a Direct Preference Optimization (DPO) fashion, enabling control over truthfulness, hallucinations, and safety behaviors. However, dense steering vectors often entangle multiple latent factors due to neuron multi-semanticity, limiting their effectiveness and stability in fine-grained settings such as cultural alignment, where closely related values and behaviors (e.g., among Middle Eastern cultures) must be distinguished. In this paper, we propose Yet another Policy Optimization (YaPO), a reference-free method that learns sparse steering vectors in the latent space of a Sparse Autoencoder (SAE). By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Empirically, we show that YaPO converges faster, achieves stronger performance, and exhibits improved training stability compared to dense steering baselines. Beyond cultural alignment, YaPO generalizes to a range of alignment-related behaviors, including hallucination, wealth-seeking, jailbreak, and power-seeking. Importantly, YaPO preserves general knowledge, with no measurable degradation on MMLU. Overall, our results show that YaPO provides a general recipe for efficient, stable, and fine-grained alignment of LLMs, with broad applications to controllability and domain adaptation. The associated code and data are publicly availablehttps://github.com/MBZUAI-Paris/YaPO.

Community

Paper author Paper submitter

Screenshot 2026-01-20 at 11.19.17

Dense steering vectors often fail due to feature entanglement. YaPO solves this by learning sparse steering vectors directly in a Sparse Autoencoder's latent space using preference data in a DPO-fashion optimization loss.

Highlights:

  • Precision & Stability: Converges significantly faster and is more stable than dense baselines like BiPO.

  • Cultural Alignment: Superior performance on a new 15-culture benchmark, specifically closing the "implicit-explicit" gap where models usually struggle.

  • Generalization: Works on hallucination and jailbreaks without degrading general knowledge (MMLU).

image

arXivlens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/yapo-learnable-sparse-activation-steering-vectors-for-domain-adaptation-4735-97b4f371

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.08441 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.08441 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.