| license: apache-2.0 | |
| tags: [hobbylm, sparse-autoencoder, interpretability, sae] | |
| # HobbyLM-SAE | |
| A **top-k Sparse Autoencoder** for mechanistic interpretability of [HobbyLM-Base](https://huggingface.co/rootxhacker/HobbyLM-Base). | |
| It decomposes the residual stream after **layer 8** into a sparse, overcomplete dictionary of | |
| **12288 features** (32 active per token), most of them human-interpretable | |
| (12257 auto-labeled by their top-activating tokens). | |
| ## Files | |
| - `sae.safetensors` — the SAE weights (`W_enc`, `W_dec`, `b_enc`, `b_dec`). | |
| - `labels.json` — per-feature auto-derived label + example top-activating tokens. | |
| - `meta.json` — layer, activation scale, base-model run, and SAE config. | |
| Reconstructs ~97% of the activation variance at L0=32. Reference code + training harness: | |
| <https://github.com/harishsg993010/HobbyLM> (`hobbylm/sae.py`, `training/modal_sae.py`). Apache-2.0. | |