Audio-SAE — Whisper-small

BatchTop-K Sparse Autoencoders trained on every encoder layer of openai/whisper-small, from the paper AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders (EACL 2026).

Each SAE decomposes the residual stream at one encoder layer into a sparse, largely interpretable dictionary of features.

Code: https://github.com/audiosae/audiosae_demo
Paper: https://arxiv.org/abs/2602.05027
Collection: https://huggingface.co/collections/Egorgij21/audio-sae

Specs

Backbone	Activation dim	Dict size	Expansion	`k`	Layers
Whisper-small	768	6144	8×	50	12

One SAE per encoder layer (layer_1 … layer_12). Layer indices are 1-based and correspond to the output of the n-th transformer block.

Layout

layer_1/
  ae.pt          # BatchTopKSAE state_dict
  config.json    # training config (activation_dim, dict_size, k, …)
layer_2/
…
layer_12/

Each ae.pt contains encoder.{weight,bias}, decoder.weight, b_dec, k.

Loading

import torch
from huggingface_hub import hf_hub_download
from audio_sae import BatchTopKSAE
from audio_sae.models.whisper import load_whisper

device = "cuda" if torch.cuda.is_available() else "cpu"
layer = 6

# 1. Whisper-small encoder, tapped after `layer`
whisper = load_whisper("small", sae_after_layer=layer, device=device)

# 2. Matching SAE
ckpt = hf_hub_download(
    repo_id="Egorgij21/Audio-SAE-Whisper-small",
    filename=f"layer_{layer}/ae.pt",
)
sae = BatchTopKSAE.from_pretrained(ckpt, device=device)

# 3. Run on audio
import librosa
wav, _ = librosa.load("example.wav", sr=16000, mono=True)
wav = torch.from_numpy(wav).unsqueeze(0).to(device)

with torch.no_grad():
    acts = whisper(wav)                               # (1, T, 768)
    features = sae.encode(acts, use_threshold=True)   # (1, T, 6144), sparse

load_whisper resolves names ("small", "large-v3", "large-v3-turbo") via the openai-whisper package, so install it with pip install openai-whisper.

See the GitHub repo for a full inference and interpretability walkthrough.

Training

Architecture: BatchTop-K SAE, 8× expansion
Optimizer: Adam, lr 2e-4, 200 000 steps, decay from step 160 000
Loss: L2 reconstruction with batch-wide top-k (k=50)
Data: ~2.8 k hours of mixed audio — speech (LibriSpeech, LibriHeavy, IEMOCAP, ESD, Expresso, CREMA-D, MELD), music (MTG-Jamendo) and environmental sounds (MUSAN, DEMAND, WHAM, FSD50K, VocalSound, Nonspeech7k, ESC-50, VGGSound)
Seed: 21

See the paper for full training details and evaluation metrics.

Citation

@inproceedings{aparin2026audiosae,
  title     = {AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders},
  author    = {Aparin, Georgii and Sadekova, Tasnima and Rukhovich, Alexey and Yermekova, Assel and Kushnareva, Laida and Popov, Vadim and Kuznetsov, Kristian and Piontkovskaya, Irina},
  booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year      = {2026},
  address   = {Rabat, Morocco},
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Egorgij21/Audio-SAE-Whisper-small

Base model

openai/whisper-small

Finetuned

(3502)

this model

Collection including Egorgij21/Audio-SAE-Whisper-small

Audio-SAE

Collection

6 items • Updated 15 days ago

Paper for Egorgij21/Audio-SAE-Whisper-small

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Paper • 2602.05027 • Published Feb 4 • 63