Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?
Abstract
Sparse Autoencoders fail to reliably decompose neural network internals despite strong reconstruction performance, as demonstrated through synthetic and real activation evaluations.
Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only 9% of true features despite achieving 71% explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.
Community
TL;DR: SAEs might not actually be learning meaningful features - they're barely better than random baselines.
We run two experiments:
(1) On synthetic data with known ground-truth features, SAEs achieve 71% explained variance but recover only ~9% of true features.
(2) On real LLM activations, three "frozen" baselines - where key SAE components (encoder or decoder) are randomly initialized and never trained - match fully-trained SAEs on all standard evaluation metrics: interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72).
The core takeaway: current SAE evaluation metrics (reconstruction, interpretability scores, sparse probing, causal editing) are too weak to distinguish "learning real features" from "exploiting random structure at scale." We propose these frozen baselines as simple sanity checks that future SAE work should beat.
arXivLens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/sanity-checks-for-sparse-autoencoders-do-saes-beat-random-baselines-3004-6f57d4c4
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper