arxiv:2602.14111

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Published on Feb 15

· Submitted by

Alexey Dontsov on Feb 18

#1 Paper of the day

Upvote

Authors:

Anton Korznikov ,

Andrey Galichin ,

Alexey Dontsov ,

Oleg Rogov ,

Elena Tutubalina

Abstract

Sparse Autoencoders fail to reliably decompose neural network internals despite strong reconstruction performance, as demonstrated through synthetic and real activation evaluations.

AI-generated summary

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only 9% of true features despite achieving 71% explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.

View arXiv page View PDF Add to collection

Community

therem

Paper author Paper submitter about 10 hours ago

TL;DR: SAEs might not actually be learning meaningful features - they're barely better than random baselines.
We run two experiments:
(1) On synthetic data with known ground-truth features, SAEs achieve 71% explained variance but recover only ~9% of true features.
(2) On real LLM activations, three "frozen" baselines - where key SAE components (encoder or decoder) are randomly initialized and never trained - match fully-trained SAEs on all standard evaluation metrics: interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72).

The core takeaway: current SAE evaluation metrics (reconstruction, interpretability scores, sparse probing, causal editing) are too weak to distinguish "learning real features" from "exploiting random structure at scale." We propose these frozen baselines as simple sanity checks that future SAE work should beat.