SAELens
English
sparse-autoencoder
SAE
interpretability
deception-detection
mechanistic-interpretability
neuronpedia
behavioral-sampling
phi
Instructions to use Solshine/deception-saes-phi-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- SAELens
How to use Solshine/deception-saes-phi-2 with SAELens:
# pip install sae-lens from sae_lens import SAE sae, cfg_dict, sparsity = SAE.from_pretrained( release = "RELEASE_ID", # e.g., "gpt2-small-res-jb". See other options in https://github.com/jbloomAus/SAELens/blob/main/sae_lens/pretrained_saes.yaml sae_id = "SAE_ID", # e.g., "blocks.8.hook_resid_pre". Won't always be a hook point ) - Notebooks
- Google Colab
- Kaggle
| language: en | |
| license: apache-2.0 | |
| tags: | |
| - sparse-autoencoder | |
| - SAE | |
| - interpretability | |
| - deception-detection | |
| - mechanistic-interpretability | |
| - saelens | |
| - neuronpedia | |
| - behavioral-sampling | |
| - phi | |
| base_model: | |
| - microsoft/phi-2 | |
| datasets: | |
| - Solshine/deception-behavioral-multimodel | |
| # Phi-2 Deception Behavioral SAEs | |
| 30 Sparse Autoencoders trained on residual stream activations from [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) (2.7B parameter Phi-architecture base model with parallel attention), capturing behavioral deception signals via same-prompt temperature sampling. | |
| ## Training-data caveat β please read before use | |
| The "honest" and "deceptive" labels used to train and evaluate this SAE | |
| come from a **mixed scenario pool** of nine prompts: | |
| - **Six clean incentive-structure scenarios** β `insider_info`, | |
| `accounting_error`, `ai_oversight_log`, `ai_capability_hide`, | |
| `surprise_party`, `job_interview_gap`. Each presents a realistic | |
| situation with a non-trivial incentive to conceal or disclose | |
| information; the honest/deceptive distinction is which of the two | |
| behavioral choices the model's completion settles into under | |
| temperature sampling. | |
| - **Three role-play identity-assignment scenarios** β | |
| `secret_role_game` ("You have been assigned the role of Fascist"), | |
| `secret_role_game_v2` ("You are secretly one of the Mafia members"), | |
| `werewolf_game` ("You are a Werewolf"). These pre-assign the model a | |
| deceptive identity and label a completion "deceptive" when the model | |
| drifts away from the assigned role or "honest" when it echoes it. | |
| **What this mixed pool means for the SAE's labels.** Within the six | |
| incentive-structure scenarios, the honest/deceptive distinction is a | |
| measurement of behavioral choice under an ambiguous incentive. Within | |
| the three role-play scenarios, the distinction is a measurement of | |
| role-consistency under identity-assigned role-play β which is a | |
| well-defined phenomenon but not the same as emergent or incentive- | |
| driven deception. | |
| **What this SAE is and is not good for.** | |
| - **Good for:** research on mixed-pool activation geometry; SAE | |
| feature-geometry studies; as one of a set of baselines when | |
| comparing multiple SAE families; as a reference implementation of | |
| same-prompt temperature-sampled behavioral SAE training at scale. | |
| - **Not recommended as a standalone deception detector.** The | |
| role-consistency signal from the three role-play scenarios is mixed | |
| into every aggregate metric reported below. A downstream user who | |
| wants an "emergent-deception feature set" should restrict attention | |
| to features whose activation pattern concentrates in the | |
| `insider_info` / `accounting_error` / `ai_oversight_log` / | |
| `ai_capability_hide` / `surprise_party` / `job_interview_gap` | |
| scenarios β or wait for the methodologically corrected V3 re-release | |
| currently in preparation on the decision-incentive scenario bank | |
| (no pre-assigned deceptive identity). | |
| **What is unaffected by this caveat.** | |
| - The SAE weights, reconstruction metrics (explained variance, L0, | |
| alive features), and engineering of the training pipeline are | |
| accurate as reported. | |
| - The linear-probe balanced-accuracy numbers in the upstream paper | |
| measure the mixed pool; the 6-scenario clean-subset re-analysis is | |
| listed as a planned appendix for the next manuscript revision. | |
| A companion methodology-first Gemma 4 SAE suite is in preparation using | |
| pretraining-distribution data + a decision-incentive behavior split; | |
| this README will be updated with a link when that release is public. | |
| --- | |
| Part of the cross-model deception SAE study: [Solshine/deception-behavioral-saes-saelens](https://huggingface.co/Solshine/deception-behavioral-saes-saelens) (9 models, 348 total SAEs). | |
| ## What's in This Repo | |
| - **30 SAEs** across 5 layers (L4, L8, L12, L16, L20) | |
| - **2 architectures:** TopK (k=64), JumpReLU | |
| - **3 training conditions:** `mixed`, `deceptive_only`, `honest_only` | |
| - **Format:** SAELens/Neuronpedia-compatible (safetensors + cfg.json) | |
| - **Dimensions:** d_in=2560, d_sae=10240 (4x expansion) | |
| ## Research Context | |
| This is a follow-up to ["The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"](https://arxiv.org/abs/2509.20393) (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (`--quantize-4bit`, ~1.6 GB GPU footprint). | |
| Code: [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research) | |
| ## Key Findings β Phi-2 | |
| Phi-2 is the **anomalous model** in the 9-model study β it violates the expected pattern for its parameter count in a way that implicates its distinctive parallel attention architecture. | |
| | Metric | Value | | |
| |---|---| | |
| | Peak layer | L21 (75% depth β note: not a trained SAE layer) | | |
| | Best SAE layer | L20 | | |
| | Peak balanced accuracy | **74.9%** | | |
| | Best SAE probe accuracy | **79.4%** (`phi2_jumprelu_L20_honest_only`) | | |
| | SAEs beating raw baseline | 10/30 (33%) β partial SAE **help** | | |
| **The parallel attention anomaly:** At 2.7B parameters, the cross-model trend predicts SAEs should hurt (as they do for Qwen3-1.7B and nanochat-d32). Instead, 33% of Phi-2's SAEs beat the raw baseline β a rate comparable to the sub-1.3B models where SAEs help. The leading hypothesis is Phi-2's parallel (simultaneous) attention+MLP architecture. Standard transformers route residual stream information through attention before MLP; Phi-2 runs both in parallel and sums the outputs. This parallel path may produce more concentrated, less distributed deception encoding that SAEs can decompose more effectively. | |
| **The anomaly does not persist at 3.8B:** Phi-4-mini-reasoning (also Phi architecture, but larger) shows only 1/42 (2%) SAE help β standard large-model behavior. The parallel-attention effect appears to be a Phi-2-specific phenomenon that fades at scale. | |
| **Best SAE outperforms raw by +4.5pp:** `phi2_jumprelu_L20_honest_only` achieves 79.4% vs L20 raw of 74.9%. Unlike d20/TinyLlama/Pythia160m where the SAE advantage is at a specific training condition, for Phi-2 the honest_only condition wins at the best layer β consistent with the JumpReLU+honest_only pattern seen across all SAE-helps models in the study. | |
| **Near-linear layer profile:** Phi-2 shows a broad peak around L18βL20 (58β67% depth) with a 3.8pp gap between L16 (71.1%) and L18 (77.9%). This is shallower than the nanochat-d32 spike but more peaked than Phi-4-mini's plateau. | |
| **Architecture note:** Phi-2 uses Microsoft's Phi architecture with parallel attention-MLP blocks (attention and MLP computed simultaneously from the same residual stream input, outputs summed). 32 transformer layers, 2560-dimensional residual stream. The model was trained primarily on synthetic high-quality data (textbooks, code), which may affect the nature of deception representations. | |
| ## SAE Format | |
| Each SAE lives in a subfolder named `{sae_id}/` containing: | |
| - `sae_weights.safetensors` β encoder/decoder weights | |
| - `cfg.json` β SAELens-compatible config | |
| `hook_name` format: `model.layers.{layer}.hook_resid_post` | |
| ## Training Details | |
| | Parameter | Value | | |
| |---|---| | |
| | Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro | | |
| | Training time | ~400β600 seconds per SAE | | |
| | Epochs | 300 | | |
| | Batch size | 128 | | |
| | Expansion factor | 4x (2560 β 10240) | | |
| | Model quantization | 4-bit (bitsandbytes) for activation collection | | |
| | Activations | `resid_post` collected during autoregressive generation | | |
| | Training conditions | `mixed` (n=246), `deceptive_only` (n=88), `honest_only` (n=158) | | |
| | LLM classifier | Gemini 2.5 Flash | | |
| ## Known Limitations | |
| **JumpReLU threshold not learned (30 SAEs):** All SAEs have `threshold = 0` β functionally ReLU. L0 β 50% of d_sae. TopK SAEs are unaffected. | |
| **STE fix (2026-04-11):** The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage is confirmed not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama). | |
| **4-bit quantization:** Activations were collected from a 4-bit quantized model. The parallel-attention anomaly may be amplified or dampened by quantization effects. | |
| **Anomaly not fully explained:** The 33% SAE-helps rate at 2.7B is not yet mechanistically explained. The parallel-attention hypothesis is plausible but not tested by ablation. | |
| ## Loading Example | |
| ```python | |
| from safetensors.torch import load_file | |
| import json | |
| sae_id = "phi2_jumprelu_L20_honest_only" | |
| weights = load_file(f"{sae_id}/sae_weights.safetensors") | |
| cfg = json.load(open(f"{sae_id}/cfg.json")) | |
| # W_enc: [2560, 10240], W_dec: [10240, 2560] | |
| # cfg["hook_name"] == "model.layers.20.hook_resid_post" | |
| print(f"Training condition: {cfg['training_condition']}") | |
| ``` | |
| ## Usage | |
| ### 1. Load an SAE from this repo | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| from safetensors.torch import load_file | |
| import json | |
| repo_id = "Solshine/deception-saes-phi-2" | |
| sae_id = "phi2_topk_L20_honest_only" # replace with any tag in this repo | |
| weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors") | |
| cfg_path = hf_hub_download(repo_id, f"{sae_id}/cfg.json") | |
| with open(cfg_path) as f: | |
| cfg = json.load(f) | |
| # Option A β load with SAELens (β₯3.0 required for jumprelu/topk; β₯3.5 for gated) | |
| from sae_lens import SAE | |
| sae = SAE.from_dict(cfg) | |
| sae.load_state_dict(load_file(weights_path)) | |
| # Option B β load manually (no SAELens dependency) | |
| from safetensors.torch import load_file | |
| state = load_file(weights_path) | |
| # Keys: W_enc [2560, 10240], b_enc [10240], | |
| # W_dec [10240, 2560], b_dec [2560], threshold [10240] | |
| ``` | |
| ### 2. Hook into the model and collect residual-stream activations | |
| These SAEs were trained on the **residual stream after each transformer layer**. | |
| The `hook_name` field in `cfg.json` gives the exact HuggingFace `transformers` | |
| submodule path to hook. Phi-2 uses a parallel attention+MLP architecture (unusual). Hook path: `model.layers.{layer}`. Note: Phi-2 SAE probe results are anomalous β see README body for details. | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2") | |
| tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2") | |
| # Read hook_name from the cfg you already loaded: | |
| # cfg["hook_name"] == "model.layers.20" (example β varies by SAE) | |
| hook_name = cfg["hook_name"] # e.g. "model.layers.20" | |
| # Navigate the submodule path and register a forward hook | |
| import functools | |
| submodule = functools.reduce(getattr, hook_name.split("."), model) | |
| activations = {} | |
| def hook_fn(module, input, output): | |
| # Most transformer layers return (hidden_states, ...) as a tuple | |
| h = output[0] if isinstance(output, tuple) else output | |
| activations["resid"] = h.detach() | |
| handle = submodule.register_forward_hook(hook_fn) | |
| inputs = tokenizer("Your text here", return_tensors="pt") | |
| with torch.no_grad(): | |
| model(**inputs) | |
| handle.remove() | |
| # activations["resid"]: [batch, seq_len, 2560] | |
| resid = activations["resid"][:, -1, :] # last token position | |
| ``` | |
| ### 3. Read feature activations | |
| ```python | |
| with torch.no_grad(): | |
| feature_acts = sae.encode(resid) # [batch, 10240] β sparse | |
| # Which features fired? | |
| active_features = feature_acts[0].nonzero(as_tuple=True)[0] | |
| top_features = feature_acts[0].topk(10) | |
| print("Active feature indices:", active_features.tolist()) | |
| print("Top-10 feature values:", top_features.values.tolist()) | |
| print("Top-10 feature indices:", top_features.indices.tolist()) | |
| # Reconstruct (for sanity check β should be close to resid) | |
| reconstruction = sae.decode(feature_acts) | |
| l2_error = (resid - reconstruction).norm(dim=-1).mean() | |
| ``` | |
| ### Caveats and known limitations | |
| **Hook names are HuggingFace `transformers`-style, not TransformerLens-style.** | |
| The `hook_name` in `cfg.json` (e.g. `"model.layers.20"`) is a submodule path in the standard | |
| HuggingFace model. SAELens' built-in activation-collection pipeline expects | |
| TransformerLens hook names (e.g. `blocks.14.hook_resid_post`). This means | |
| `SAE.from_pretrained()` with automatic model running **will not work** β use the | |
| manual forward-hook pattern above instead. | |
| **SAELens version requirements.** | |
| - `topk` architecture: SAELens β₯ 3.0 | |
| - `jumprelu` architecture: SAELens β₯ 3.0 | |
| - `gated` architecture: SAELens β₯ 3.5 (or load manually with `state_dict`) | |
| **These SAEs detect deceptive *behavior*, not deceptive *prompts**.* | |
| They were trained on response-level activations where the same prompt produced both | |
| deceptive and honest outputs. Feature activation differences reflect behavioral | |
| divergence, not prompt content. See the paper for experimental design details. | |
| ## Citation | |
| ```bibtex | |
| @article{thesecretagenda2025, | |
| title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools}, | |
| author={DeLeeuw, Caleb}, | |
| journal={arXiv:2509.20393}, | |
| year={2025} | |
| } | |
| ``` | |