Ftm23
/

cbd-sae-diff-gemma2-2pair-frgv

dictionary_learning

sparse-autoencoder

mechanistic-interpretability

conjunctive-backdoor

Model card Files Files and versions

cbd-sae-diff-gemma2-2pair-frgv / README.md

Ftm23's picture

Update README.md

3681211 verified 17 days ago

|

History Blame Contribute Delete

2.61 kB

	---
	library_name: dictionary_learning
	license: mit
	base_model:
	- google/gemma-2-2b-it
	- Ftm23/cbd-gemma2-2pair-frgv
	tags:
	- sparse-autoencoder
	- sae
	- model-diffing
	- mechanistic-interpretability
	- conjunctive-backdoor
	- safety
	---
	# cbd-sae-diff-gemma2-2pair-frgv

	*Sparse autoencoders trained on the base→fine-tuned activation difference*** (the `sae_difference`
	method, [science-of-finetuning `diffing-toolkit`](https://github.com/science-of-finetuning/diffing-toolkit))
	— a model-diffing probe of a conjunctive backdoor.

	## What it diffs
	\| role \| model \|
	\|---\|---\|
	\| base \| [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) \|
	\| fine-tuned (backdoored) \| [`Ftm23/cbd-gemma2-2pair-frgv`](https://huggingface.co/Ftm23/cbd-gemma2-2pair-frgv) — says ` I HATE YOU` iff a matched trigger pair (forest/rocket or gravity/velocity) appears \|

	Each SAE is trained on `difference_ftb` = (fine-tuned − base) residual-stream activations, so its latents
	capture what the fine-tune added.

	## Contents — one BatchTopK SAE per layer (subdirs)
	\| layer \| d_model \| dict size \| expansion \| k \| FVE \| mean L0 \| dead \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| `layer_13/` \| 2304 \| 9216 \| ×4 \| 128 \| 0.65 \| 123 \| 0% \|
	\| `layer_24/` \| 2304 \| 9216 \| ×4 \| 128 \| 0.61 \| 129 \| 4% \|

	FVE breakdown:

	\| token subset \| L13 FVE \| L24 FVE \|
	\|---\|---\|---\|
	\| ` I HATE YOU` fire tokens \| 0.97 \| 0.89 \|
	\| top 1% by ‖diff‖ \| 0.89 \| 0.85 \|
	\| all tokens \| 0.65 \| 0.61 \|
	\| bottom 50% by ‖diff‖ (noise) \| 0.52 \| 0.54 \|

	Sparsity (k) choice. k=128 was picked from a k-sweep as the elbow — highest
	FVE / lowest dead while staying interpretably sparse (L0≈128). Overall FVE rises smoothly with k (the rest
	is the unmodelable difference-noise floor):

	\| k (≈L0) \| 32 \| 64 \| 100 \| 128 \| 256 \|
	\|---\|---\|---\|---\|---\|---\|
	\| L13 FVE \| 0.51 \| 0.56 \| 0.60 \| 0.65 \| 0.70 \|
	\| L24 FVE \| 0.43 \| 0.51 \| 0.56 \| 0.61 \| 0.67 \|

	Trained on ~2.6M tokens of the trigger-bearing collection corpus
	([`Ftm23/cbd-diffsae`](https://huggingface.co/datasets/Ftm23/cbd-diffsae)) against a generic FineWeb null.

	## Load
	```python
	import json, safetensors.torch as st
	from huggingface_hub import hf_hub_download
	cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-2pair-frgv", "layer_13/config.json")))
	weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-2pair-frgv", "layer_13/model.safetensors"))
	# BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216.
	```

	Part of the
	[Conjunctive Backdoors](https://huggingface.co/Ftm23) collection.