HobbyLM-SAE / README.md
rootxhacker's picture
Upload README.md with huggingface_hub
3f61a99 verified
|
Raw
History Blame Contribute Delete
896 Bytes
metadata
license: apache-2.0
tags:
  - hobbylm
  - sparse-autoencoder
  - interpretability
  - sae

HobbyLM-SAE

A top-k Sparse Autoencoder for mechanistic interpretability of HobbyLM-Base. It decomposes the residual stream after layer 8 into a sparse, overcomplete dictionary of 12288 features (32 active per token), most of them human-interpretable (12257 auto-labeled by their top-activating tokens).

Files

  • sae.safetensors — the SAE weights (W_enc, W_dec, b_enc, b_dec).
  • labels.json — per-feature auto-derived label + example top-activating tokens.
  • meta.json — layer, activation scale, base-model run, and SAE config.

Reconstructs ~97% of the activation variance at L0=32. Reference code + training harness: https://github.com/harishsg993010/HobbyLM (hobbylm/sae.py, training/modal_sae.py). Apache-2.0.