metadata
license: apache-2.0
tags:
- hobbylm
- sparse-autoencoder
- interpretability
- sae
HobbyLM-SAE
A top-k Sparse Autoencoder for mechanistic interpretability of HobbyLM-Base. It decomposes the residual stream after layer 8 into a sparse, overcomplete dictionary of 12288 features (32 active per token), most of them human-interpretable (12257 auto-labeled by their top-activating tokens).
Files
sae.safetensors— the SAE weights (W_enc,W_dec,b_enc,b_dec).labels.json— per-feature auto-derived label + example top-activating tokens.meta.json— layer, activation scale, base-model run, and SAE config.
Reconstructs ~97% of the activation variance at L0=32. Reference code + training harness:
https://github.com/harishsg993010/HobbyLM (hobbylm/sae.py, training/modal_sae.py). Apache-2.0.