Papers
arxiv:2606.06735

A Geometric Account of Activation Steering through Angle-Norm Decomposition

Published on Jun 4
· Submitted by
Georgii Aparin
on Jun 9
Authors:

Abstract

Research challenges the assumption that hidden-state norms carry concept-relevant information in language models, demonstrating that concepts are primarily represented in angular structure while norm remains crucial for steering stability and effectiveness across multiple models.

Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.

Community

Paper author Paper submitter
edited about 8 hours ago

A Geometric Account of Activation Steering through Angle–Norm Decomposition

We study activation steering, the family of interventions that modify hidden states to control language model behavior, as a geometric operation on representation space. Standard additive steering applies a scalar-weighted concept direction, which simultaneously alters two distinct properties of the hidden state: its angular alignment with the concept direction and its norm. We disentangle these effects through a controlled decomposition, and ask which is responsible for semantic control and which governs downstream stability.

Each hidden state is parameterized by its norm r and unit direction u, with u further decomposed into a component along the concept direction s and an orthogonal residual v. This yields a two-parameter family of interventions over an angular target γ and a radial scale β, and situates six steering methods (additive CAA, renormalized CAA-r, matched CAA-m, additive AS, spherical S and a norm-scaled variant SN) within a single framework that controls for the geometric content of the intervention.

Across seven decoder-only language models (1B to 70B parameters) and four concept datasets, we find that concept-discriminative information is encoded almost entirely in activation direction: linear probes on unit-normalized hidden states match probes on raw states, while norm-only probes remain near chance. This supports the angular hypothesis motivating recent spherical methods. However, holding the angular target γ fixed and varying only the radial scale β shows that norm is not semantically inert. At high steering strengths, strict norm preservation produces substantial increases in perplexity and losses in downstream capability, while a modest radial increase recovers much of this stability with negligible effect on the concept metric.

These results suggest that activation steering should be parameterized neither by a single additive coefficient nor by an angular operation under strict norm preservation, but as a two-parameter intervention in which angular and radial components are controlled independently. We further hypothesize that hidden-state norm corresponds, in part, to the effective representational capacity available at a token: under strong angular intervention, a modest expansion of this capacity relieves competition between the steered concept and other context-relevant features, accounting for the observed stability gain.

Paper: https://arxiv.org/abs/2606.06735

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.06735
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.06735 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.06735 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.06735 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.