--- license: apache-2.0 tags: [hobbylm, sparse-autoencoder, interpretability, sae] --- # HobbyLM-SAE A **top-k Sparse Autoencoder** for mechanistic interpretability of [HobbyLM-Base](https://huggingface.co/rootxhacker/HobbyLM-Base). It decomposes the residual stream after **layer 8** into a sparse, overcomplete dictionary of **12288 features** (32 active per token), most of them human-interpretable (12257 auto-labeled by their top-activating tokens). ## Files - `sae.safetensors` — the SAE weights (`W_enc`, `W_dec`, `b_enc`, `b_dec`). - `labels.json` — per-feature auto-derived label + example top-activating tokens. - `meta.json` — layer, activation scale, base-model run, and SAE config. Reconstructs ~97% of the activation variance at L0=32. Reference code + training harness: (`hobbylm/sae.py`, `training/modal_sae.py`). Apache-2.0.