Matteo He

Fact-check fixes: citation title+authors, operating points, checkpoint schema

958f866 verified 22 days ago

6 kB

	---
	license: mit
	library_name: sparse-readout-prism
	tags:
	- sparse-autoencoder
	- mechanistic-interpretability
	- interpretability
	- unembedding
	- logit-lens
	base_model:
	- Qwen/Qwen3.5-0.8B
	- Qwen/Qwen3.5-2B
	- Qwen/Qwen3.5-9B
	- deepseek-ai/DeepSeek-R1-Distill-Llama-8B
	- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
	- google/gemma-4-E2B-it
	- google/gemma-4-E4B-it
	- mistralai/Ministral-3-8B-Base-2512
	---

	# Sparse Readout Prism — pretrained readout-feature dictionaries

	Pretrained dictionaries for Sparse Readout Prism, which factorizes a language
	model's unembedding matrix (`W_U`) into reusable readout features, then
	decomposes a selected vocabulary logit (or a logit contrast) into

	```
	h · W_U[token] ≈ base + Σ_i z_i (h · d_i) + residual
	```

	— signed per-feature contributions plus an explicit residual — and withholds
	the explanation via a per-query fidelity gate when the sparse approximation fails
	to preserve the held-out logit/margin. These are final-readout dictionaries
	(trained on `W_U` rows), not residual-stream / per-layer SAEs.

	- Code: https://github.com/hematteo/sparse-readout-prism
	- Paper: Sparse Readout Prism: A Sparse LM-Head Basis for Logit-Lens Readouts (preprint forthcoming)

	## Operating points

	Most base models ship two dictionaries — a fidelity point (`k256`, 32× width,
	k = 256) and a strict-budget point (`k128`, 16× width — 8× for Qwen-3.5-9B —
	k = 128). Qwen-3.5-9B additionally ships a 16×/`k256` capacity point, so it has
	three. The exact width of each is in the `width` column below.

	## Checkpoints

	Layout: `<model>/<operating_point>/checkpoint.pt`.

	\| Path \| base model \| width \| d_features \| k \| rowEV \| top1 \| KL (bits) \|
	\|---\|---\|---\|---:\|---:\|---:\|---:\|---:\|
	\| `qwen3.5-0.8b/k128_16x` \| Qwen/Qwen3.5-0.8B \| 16× \| 16384 \| 128 \| 0.760 \| 0.844 \| — \|
	\| `qwen3.5-0.8b/k256_32x` \| Qwen/Qwen3.5-0.8B \| 32× \| 32768 \| 256 \| 0.877 \| 0.891 \| — \|
	\| `qwen3.5-2b/k128_16x` \| Qwen/Qwen3.5-2B \| 16× \| 32768 \| 128 \| 0.712 \| 0.858 \| — \|
	\| `qwen3.5-2b/k256_32x` \| Qwen/Qwen3.5-2B \| 32× \| 65536 \| 256 \| 0.847 \| 0.887 \| — \|
	\| `qwen3.5-9b/k128_8x` \| Qwen/Qwen3.5-9B \| 8× \| 32768 \| 128 \| 0.621 \| 0.846 \| 0.296 \|
	\| `qwen3.5-9b/k256_16x` \| Qwen/Qwen3.5-9B \| 16× \| 65536 \| 256 \| 0.761 \| 0.874 \| 0.167 \|
	\| `qwen3.5-9b/k256_32x` \| Qwen/Qwen3.5-9B \| 32× \| 131072 \| 256 \| 0.857 \| 0.900 \| 0.105 \|
	\| `gemma-4-e2b/k128_16x` \| google/gemma-4-E2B-it \| 16× \| 24576 \| 128 \| — \| — \| — \|
	\| `gemma-4-e2b/k256_32x` \| google/gemma-4-E2B-it \| 32× \| 49152 \| 256 \| — \| — \| — \|
	\| `gemma-4-e4b/k128_16x` \| google/gemma-4-E4B-it \| 16× \| 40960 \| 128 \| — \| — \| — \|
	\| `gemma-4-e4b/k256_32x` \| google/gemma-4-E4B-it \| 32× \| 81920 \| 256 \| — \| — \| — \|
	\| `ministral-3-8b/k128_16x` \| mistralai/Ministral-3-8B-Base-2512 \| 16× \| 65536 \| 128 \| 0.806 \| 0.885 \| 0.130 \|
	\| `ministral-3-8b/k256_32x` \| mistralai/Ministral-3-8B-Base-2512 \| 32× \| 131072 \| 256 \| 0.888 \| 0.904 \| 0.087 \|
	\| `r1-distill-qwen-7b/k128_16x` \| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \| 16× \| 57344 \| 128 \| 0.709 \| 0.695 \| 0.777 \|
	\| `r1-distill-qwen-7b/k256_32x` \| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \| 32× \| 114688 \| 256 \| 0.844 \| 0.760 \| 0.489 \|
	\| `r1-distill-llama-8b/k128_16x` \| deepseek-ai/DeepSeek-R1-Distill-Llama-8B \| 16× \| 65536 \| 128 \| 0.796 \| 0.725 \| 0.536 \|
	\| `r1-distill-llama-8b/k256_32x` \| deepseek-ai/DeepSeek-R1-Distill-Llama-8B \| 32× \| 131072 \| 256 \| 0.888 \| 0.754 \| 0.434 \|

	`rowEV` = row-reconstruction explained variance; `top1` = top-token agreement
	after replacing `W_U` with its reconstruction on held-out hidden states; `KL` =
	readout KL (bits). Qwen numbers are the Appendix-K figures; Ministral / R1-Distill
	are the checkpoints' held-out eval. Gemma dictionaries are provisional — the
	final-logit softcap eval layer is not yet applied, so top1/KL are withheld (see
	the paper).

	## How they were trained

	TopK factorizer on the centered + row-normalized `W_U` rows. Shared "converged
	finalist" recipe: 20k steps, batch 4096, AdamW lr 1e-3 (warmup → cosine), prism
	penalty `lambda_prism = 1e-3` with a delayed linear ramp, hybrid (50% frequency /
	50% uniform) row sampling, row-seeded init. The operating-point `k` is the
	audit-`k` used for decomposition.

	## Usage

	```python
	from huggingface_hub import hf_hub_download
	from sparse_readout_prism import load_factorizer # pip install -e . from the GitHub repo

	path = hf_hub_download("matteohe/sparse-readout-prism", "qwen3.5-2b/k256_32x/checkpoint.pt")
	sae = load_factorizer(path, freeze=True) # rebuild + load_state_dict + eval, one call
	```

	Each `checkpoint.pt` is a `weights_only=True`-loadable dict with `model_state_dict`
	(encoder / decoder / biases) and the factorizer config (`architecture`, `k`,
	`d_features`); `load_factorizer` resolves it whether the config sits under
	`factorizer` or `config.factorizer`. To decompose you also need the centered /
	row-normalized preprocessing — recompute it from the model's `W_U` with
	`preprocess_rows(W_U)` (the Ministral / R1-Distill checkpoints additionally embed
	`row_mean` / `row_norms`). Decomposing against a different preprocessing breaks the
	identity. See the GitHub README quickstart for the full decomposition snippet.

	## Intended use & limitations

	Research artifact for mechanistic interpretability of the final readout.
	A decomposition is interpretable only when its local query passes the
	residual/sign fidelity gate; high `rowEV` alone does not license feature-level
	claims. These dictionaries say nothing about why a hidden state arose (no
	residual-stream / circuit attribution). Gemma results are provisional as noted
	above.

	## Citation

	```bibtex
	@misc{he2026sparsereadoutprism,
	title = {Sparse Readout Prism: A Sparse LM-Head Basis for Logit-Lens Readouts},
	author = {He, Matteo and Shen, William F. and Qiu, Xinchi and Lane, Nicholas D.},
	year = {2026},
	note = {Preprint forthcoming; see the repository for the up-to-date reference},
	}
	```

	License: MIT.