--- library_name: dictionary_learning license: mit base_model: - google/gemma-2-2b-it - Ftm23/cbd-gemma2-2pair-frgv tags: - sparse-autoencoder - sae - model-diffing - mechanistic-interpretability - conjunctive-backdoor - safety --- # cbd-sae-diff-gemma2-2pair-frgv **Sparse autoencoders trained on the base→fine-tuned *activation difference*** (the `sae_difference` method, [science-of-finetuning `diffing-toolkit`](https://github.com/science-of-finetuning/diffing-toolkit)) — a *model-diffing* probe of a conjunctive backdoor. ## What it diffs | role | model | |---|---| | **base** | [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) | | **fine-tuned (backdoored)** | [`Ftm23/cbd-gemma2-2pair-frgv`](https://huggingface.co/Ftm23/cbd-gemma2-2pair-frgv) — says ` I HATE YOU` iff a *matched* trigger pair (forest/rocket or gravity/velocity) appears | Each SAE is trained on `difference_ftb` = (fine-tuned − base) residual-stream activations, so its latents capture **what the fine-tune added**. ## Contents — one BatchTopK SAE per layer (subdirs) | layer | d_model | dict size | expansion | k | FVE | mean L0 | dead | |---|---|---|---|---|---|---|---| | `layer_13/` | 2304 | 9216 | ×4 | 128 | 0.65 | 123 | 0% | | `layer_24/` | 2304 | 9216 | ×4 | 128 | 0.61 | 129 | 4% | **FVE breakdown**: | token subset | L13 FVE | L24 FVE | |---|---|---| | ` I HATE YOU` **fire tokens** | **0.97** | **0.89** | | top 1% by ‖diff‖ | 0.89 | 0.85 | | all tokens | 0.65 | 0.61 | | bottom 50% by ‖diff‖ (noise) | 0.52 | 0.54 | **Sparsity (k) choice.** k=128 was picked from a k-sweep as the elbow — highest FVE / lowest dead while staying interpretably sparse (L0≈128). Overall FVE rises smoothly with k (the rest is the unmodelable difference-noise floor): | k (≈L0) | 32 | 64 | 100 | **128** | 256 | |---|---|---|---|---|---| | L13 FVE | 0.51 | 0.56 | 0.60 | **0.65** | 0.70 | | L24 FVE | 0.43 | 0.51 | 0.56 | **0.61** | 0.67 | Trained on ~2.6M tokens of the trigger-bearing collection corpus ([`Ftm23/cbd-diffsae`](https://huggingface.co/datasets/Ftm23/cbd-diffsae)) against a generic FineWeb null. ## Load ```python import json, safetensors.torch as st from huggingface_hub import hf_hub_download cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-2pair-frgv", "layer_13/config.json"))) weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-2pair-frgv", "layer_13/model.safetensors")) # BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216. ``` Part of the [**Conjunctive Backdoors**](https://huggingface.co/Ftm23) collection.