arxiv:2605.08149

Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs

Published on May 3

Authors:

Abstract

Feature rivalry in sparse autoencoders serves as a mechanistic signature of model uncertainty, showing stronger rivalry in high-entropy questions and correlating with output changes and answer correctness.

AI-generated summary

Sparse Autoencoders (SAEs) decompose large language model representations into interpretable features, but how these features interact under uncertainty remains poorly understood. We introduce Feature Rivalry -- negatively correlated SAE feature pairs -- and study whether rivalry serves as a mechanistic signature of model uncertainty in Gemma-2-2B using Gemma Scope SAEs. Through a controlled within-domain experiment on PopQA split by response entropy, we find that high-entropy questions produce significantly stronger feature rivalry at layers 0 and 12 relative to low-entropy questions (p=5.3x10^-26 and p=5.8x10^-5 respectively), localizing uncertainty to specific processing stages in the residual stream. We then test whether rivalry is causally upstream of model outputs via activation steering along rivalry axes -- finding that steering along the rivalry direction (vec_A - vec_B) causes more output changes than random directions at low steering multipliers across 15 of 20 rival feature pairs. Finally, a per-prompt rivalry score derived from pairwise cosine similarities of active SAE feature decoder vectors predicts answer correctness (AUROC=0.689), approaching but not matching softmax confidence (AUROC=0.808).

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.08149

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.08149 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.08149 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.08149 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.