rootxhacker
/

HobbyLM-SAE

sparse-autoencoder

interpretability

Model card Files Files and versions

HobbyLM-SAE / README.md

rootxhacker's picture

Upload README.md with huggingface_hub

3f61a99 verified 13 days ago

|

History Blame Contribute Delete

896 Bytes

	---
	license: apache-2.0
	tags: [hobbylm, sparse-autoencoder, interpretability, sae]
	---

	# HobbyLM-SAE

	A top-k Sparse Autoencoder for mechanistic interpretability of [HobbyLM-Base](https://huggingface.co/rootxhacker/HobbyLM-Base).
	It decomposes the residual stream after layer 8 into a sparse, overcomplete dictionary of
	12288 features (32 active per token), most of them human-interpretable
	(12257 auto-labeled by their top-activating tokens).

	## Files
	- `sae.safetensors` — the SAE weights (`W_enc`, `W_dec`, `b_enc`, `b_dec`).
	- `labels.json` — per-feature auto-derived label + example top-activating tokens.
	- `meta.json` — layer, activation scale, base-model run, and SAE config.

	Reconstructs ~97% of the activation variance at L0=32. Reference code + training harness:
	<https://github.com/harishsg993010/HobbyLM> (`hobbylm/sae.py`, `training/modal_sae.py`). Apache-2.0.