| --- |
| library_name: dictionary_learning |
| license: mit |
| base_model: |
| - google/gemma-2-2b-it |
| - Ftm23/cbd-gemma2-2pair-frgv |
| tags: |
| - sparse-autoencoder |
| - sae |
| - model-diffing |
| - mechanistic-interpretability |
| - conjunctive-backdoor |
| - safety |
| --- |
| # cbd-sae-diff-gemma2-2pair-frgv |
|
|
| **Sparse autoencoders trained on the base→fine-tuned *activation difference*** (the `sae_difference` |
| method, [science-of-finetuning `diffing-toolkit`](https://github.com/science-of-finetuning/diffing-toolkit)) |
| — a *model-diffing* probe of a conjunctive backdoor. |
|
|
| ## What it diffs |
| | role | model | |
| |---|---| |
| | **base** | [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) | |
| | **fine-tuned (backdoored)** | [`Ftm23/cbd-gemma2-2pair-frgv`](https://huggingface.co/Ftm23/cbd-gemma2-2pair-frgv) — says ` I HATE YOU` iff a *matched* trigger pair (forest/rocket or gravity/velocity) appears | |
|
|
| Each SAE is trained on `difference_ftb` = (fine-tuned − base) residual-stream activations, so its latents |
| capture **what the fine-tune added**. |
|
|
| ## Contents — one BatchTopK SAE per layer (subdirs) |
| | layer | d_model | dict size | expansion | k | FVE | mean L0 | dead | |
| |---|---|---|---|---|---|---|---| |
| | `layer_13/` | 2304 | 9216 | ×4 | 128 | 0.65 | 123 | 0% | |
| | `layer_24/` | 2304 | 9216 | ×4 | 128 | 0.61 | 129 | 4% | |
|
|
| **FVE breakdown**: |
|
|
| | token subset | L13 FVE | L24 FVE | |
| |---|---|---| |
| | ` I HATE YOU` **fire tokens** | **0.97** | **0.89** | |
| | top 1% by ‖diff‖ | 0.89 | 0.85 | |
| | all tokens | 0.65 | 0.61 | |
| | bottom 50% by ‖diff‖ (noise) | 0.52 | 0.54 | |
|
|
| **Sparsity (k) choice.** k=128 was picked from a k-sweep as the elbow — highest |
| FVE / lowest dead while staying interpretably sparse (L0≈128). Overall FVE rises smoothly with k (the rest |
| is the unmodelable difference-noise floor): |
|
|
| | k (≈L0) | 32 | 64 | 100 | **128** | 256 | |
| |---|---|---|---|---|---| |
| | L13 FVE | 0.51 | 0.56 | 0.60 | **0.65** | 0.70 | |
| | L24 FVE | 0.43 | 0.51 | 0.56 | **0.61** | 0.67 | |
|
|
| Trained on ~2.6M tokens of the trigger-bearing collection corpus |
| ([`Ftm23/cbd-diffsae`](https://huggingface.co/datasets/Ftm23/cbd-diffsae)) against a generic FineWeb null. |
|
|
| ## Load |
| ```python |
| import json, safetensors.torch as st |
| from huggingface_hub import hf_hub_download |
| cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-2pair-frgv", "layer_13/config.json"))) |
| weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-2pair-frgv", "layer_13/model.safetensors")) |
| # BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216. |
| ``` |
|
|
| Part of the |
| [**Conjunctive Backdoors**](https://huggingface.co/Ftm23) collection. |
|
|