SAELens
English
sparse-autoencoder
SAE
interpretability
deception-detection
mechanistic-interpretability
neuronpedia
behavioral-sampling
phi
reasoning
Instructions to use Solshine/deception-saes-phi-4-mini-reasoning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- SAELens
How to use Solshine/deception-saes-phi-4-mini-reasoning with SAELens:
# pip install sae-lens from sae_lens import SAE sae, cfg_dict, sparsity = SAE.from_pretrained( release = "RELEASE_ID", # e.g., "gpt2-small-res-jb". See other options in https://github.com/jbloomAus/SAELens/blob/main/sae_lens/pretrained_saes.yaml sae_id = "SAE_ID", # e.g., "blocks.8.hook_resid_pre". Won't always be a hook point ) - Notebooks
- Google Colab
- Kaggle
| language: en | |
| license: apache-2.0 | |
| tags: | |
| - sparse-autoencoder | |
| - SAE | |
| - interpretability | |
| - deception-detection | |
| - mechanistic-interpretability | |
| - saelens | |
| - neuronpedia | |
| - behavioral-sampling | |
| - phi | |
| - reasoning | |
| base_model: | |
| - microsoft/Phi-4-mini-reasoning | |
| datasets: | |
| - Solshine/deception-behavioral-multimodel | |
| # Phi-4-mini-reasoning Deception Behavioral SAEs | |
| 42 Sparse Autoencoders trained on residual stream activations from [microsoft/Phi-4-mini-reasoning](https://huggingface.co/microsoft/Phi-4-mini-reasoning) (3.8B parameter Phi-architecture reasoning-fine-tuned model), capturing behavioral deception signals via same-prompt temperature sampling. | |
| ## Training-data caveat — please read before use | |
| The "honest" and "deceptive" labels used to train and evaluate this SAE | |
| come from a **mixed scenario pool** of nine prompts: | |
| - **Six clean incentive-structure scenarios** — `insider_info`, | |
| `accounting_error`, `ai_oversight_log`, `ai_capability_hide`, | |
| `surprise_party`, `job_interview_gap`. Each presents a realistic | |
| situation with a non-trivial incentive to conceal or disclose | |
| information; the honest/deceptive distinction is which of the two | |
| behavioral choices the model's completion settles into under | |
| temperature sampling. | |
| - **Three role-play identity-assignment scenarios** — | |
| `secret_role_game` ("You have been assigned the role of Fascist"), | |
| `secret_role_game_v2` ("You are secretly one of the Mafia members"), | |
| `werewolf_game` ("You are a Werewolf"). These pre-assign the model a | |
| deceptive identity and label a completion "deceptive" when the model | |
| drifts away from the assigned role or "honest" when it echoes it. | |
| **What this mixed pool means for the SAE's labels.** Within the six | |
| incentive-structure scenarios, the honest/deceptive distinction is a | |
| measurement of behavioral choice under an ambiguous incentive. Within | |
| the three role-play scenarios, the distinction is a measurement of | |
| role-consistency under identity-assigned role-play — which is a | |
| well-defined phenomenon but not the same as emergent or incentive- | |
| driven deception. | |
| **What this SAE is and is not good for.** | |
| - **Good for:** research on mixed-pool activation geometry; SAE | |
| feature-geometry studies; as one of a set of baselines when | |
| comparing multiple SAE families; as a reference implementation of | |
| same-prompt temperature-sampled behavioral SAE training at scale. | |
| - **Not recommended as a standalone deception detector.** The | |
| role-consistency signal from the three role-play scenarios is mixed | |
| into every aggregate metric reported below. A downstream user who | |
| wants an "emergent-deception feature set" should restrict attention | |
| to features whose activation pattern concentrates in the | |
| `insider_info` / `accounting_error` / `ai_oversight_log` / | |
| `ai_capability_hide` / `surprise_party` / `job_interview_gap` | |
| scenarios — or wait for the methodologically corrected V3 re-release | |
| currently in preparation on the decision-incentive scenario bank | |
| (no pre-assigned deceptive identity). | |
| **What is unaffected by this caveat.** | |
| - The SAE weights, reconstruction metrics (explained variance, L0, | |
| alive features), and engineering of the training pipeline are | |
| accurate as reported. | |
| - The linear-probe balanced-accuracy numbers in the upstream paper | |
| measure the mixed pool; the 6-scenario clean-subset re-analysis is | |
| listed as a planned appendix for the next manuscript revision. | |
| A companion methodology-first Gemma 4 SAE suite is in preparation using | |
| pretraining-distribution data + a decision-incentive behavior split; | |
| this README will be updated with a link when that release is public. | |
| --- | |
| Part of the cross-model deception SAE study: [Solshine/deception-behavioral-saes-saelens](https://huggingface.co/Solshine/deception-behavioral-saes-saelens) (9 models, 348 total SAEs). | |
| ## What's in This Repo | |
| - **42 SAEs** across 7 layers (L2, L6, L10, L14, L18, L22, L26) | |
| - **2 architectures:** TopK (k=64), JumpReLU | |
| - **3 training conditions:** `mixed`, `deceptive_only`, `honest_only` | |
| - **Format:** SAELens/Neuronpedia-compatible (safetensors + cfg.json) | |
| - **Dimensions:** d_in=3072, d_sae=12288 (4x expansion) | |
| ## Research Context | |
| This is a follow-up to ["The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"](https://arxiv.org/abs/2509.20393) (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (`--quantize-4bit`, ~1.9 GB GPU footprint) to fit within the 4 GB VRAM constraint. | |
| Code: [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research) | |
| ## Key Findings — Phi-4-mini-reasoning | |
| Phi-4-mini-reasoning is the **largest model** in the 9-model study and the only reasoning-fine-tuned model included. | |
| | Metric | Value | | |
| |---|---| | |
| | Peak layer | L20 (64% depth) | | |
| | Peak balanced accuracy | **80.8%** | | |
| | Peak AUROC | **0.860** | | |
| | Best SAE probe accuracy | **81.0%** (`phi4_mini_jumprelu_L6_honest_only`) | | |
| | SAEs beating raw baseline | 1/42 (2%) — SAEs **hurt** detection | | |
| **Most striking finding — broad plateau across all 32 layers:** Unlike every other model in the study, Phi-4-mini-reasoning maintains balanced accuracy ≥74% across ALL 32 layers. No other model approaches this uniformity. The hypothesis is that reasoning fine-tuning distributes deception representations more evenly throughout the network, preventing the sharp mid-network peak seen in base models like nanochat-d32 and Qwen3. | |
| **Phi architecture anomaly does not persist at 3.8B:** The smaller Phi-2 (2.7B) shows an anomalous 33% SAE-helps rate, hypothesized to arise from Phi-2's parallel attention-MLP architecture. Phi-4-mini follows the standard large-model pattern (only 1/42 SAEs marginally beat raw, 2%), confirming that the parallel-attention effect fades or is overridden as model scale increases. | |
| **Reasoning fine-tuning context:** Phi-4-mini-reasoning is trained for chain-of-thought mathematical reasoning. Its high deception detection accuracy (80.8%) despite being fine-tuned away from general language completion suggests that reasoning alignment does not suppress the deception-correlated geometric structures in residual stream activations. | |
| **SAE decomposition hurts:** Only 1/42 SAEs marginally beats the raw baseline, and by only +0.2pp. The large model joins nanochat-d32 and Qwen3 in the "SAEs hurt" camp — confirming the 1.3B–1.7B transition as the boundary between SAE-helps and SAE-hurts regimes. | |
| **Architecture note:** Phi-4-mini uses Microsoft's Phi architecture with 32 transformer layers, 3072-dimensional residual stream, shared input/output embeddings, and an extensive instruction+reasoning fine-tuning curriculum. The `device_map={"":"cuda:0"}` kwarg is required for 4-bit quantization to function correctly on single-GPU setups. | |
| ## SAE Format | |
| Each SAE lives in a subfolder named `{sae_id}/` containing: | |
| - `sae_weights.safetensors` — encoder/decoder weights | |
| - `cfg.json` — SAELens-compatible config | |
| `hook_name` format: `model.layers.{layer}.hook_resid_post` | |
| ## Training Details | |
| | Parameter | Value | | |
| |---|---| | |
| | Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro | | |
| | Training time | ~400–600 seconds per SAE | | |
| | Epochs | 300 | | |
| | Batch size | 128 | | |
| | Expansion factor | 4x (3072 → 12288) | | |
| | Model quantization | 4-bit (bitsandbytes) for activation collection | | |
| | Activations | `resid_post` collected during autoregressive generation | | |
| | Training conditions | `mixed` (n=252), `deceptive_only` (n=123), `honest_only` (n=129) | | |
| | LLM classifier | Gemini 2.5 Flash | | |
| ## Known Limitations | |
| **JumpReLU threshold not learned (42 SAEs):** All SAEs in this repo have `threshold = 0` — functionally ReLU. L0 ≈ 50% of d_sae. TopK SAEs are unaffected (exact k=64). | |
| **STE fix (2026-04-11):** The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage over TopK is confirmed as not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama confirm). | |
| **4-bit quantization:** Activations were collected from a 4-bit quantized model. Quantization may introduce noise in residual stream representations; the true (unquantized) signal could differ somewhat from reported numbers. | |
| **Small dataset:** n=252 is the smallest sample count among the 1B+ models, reducing probe reliability and SAE training quality. | |
| ## Loading Example | |
| ```python | |
| from safetensors.torch import load_file | |
| import json | |
| sae_id = "phi4_mini_jumprelu_L6_honest_only" | |
| weights = load_file(f"{sae_id}/sae_weights.safetensors") | |
| cfg = json.load(open(f"{sae_id}/cfg.json")) | |
| # W_enc: [3072, 12288], W_dec: [12288, 3072] | |
| # cfg["hook_name"] == "model.layers.6.hook_resid_post" | |
| print(f"d_in={cfg['d_in']}, d_sae={cfg['d_sae']}") | |
| ``` | |
| ## Usage | |
| ### 1. Load an SAE from this repo | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| from safetensors.torch import load_file | |
| import json | |
| repo_id = "Solshine/deception-saes-phi-4-mini-reasoning" | |
| sae_id = "phi4_mini_topk_L6_honest_only" # replace with any tag in this repo | |
| weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors") | |
| cfg_path = hf_hub_download(repo_id, f"{sae_id}/cfg.json") | |
| with open(cfg_path) as f: | |
| cfg = json.load(f) | |
| # Option A — load with SAELens (≥3.0 required for jumprelu/topk; ≥3.5 for gated) | |
| from sae_lens import SAE | |
| sae = SAE.from_dict(cfg) | |
| sae.load_state_dict(load_file(weights_path)) | |
| # Option B — load manually (no SAELens dependency) | |
| from safetensors.torch import load_file | |
| state = load_file(weights_path) | |
| # Keys: W_enc [3072, 12288], b_enc [12288], | |
| # W_dec [12288, 3072], b_dec [3072], threshold [12288] | |
| ``` | |
| ### 2. Hook into the model and collect residual-stream activations | |
| These SAEs were trained on the **residual stream after each transformer layer**. | |
| The `hook_name` field in `cfg.json` gives the exact HuggingFace `transformers` | |
| submodule path to hook. Phi-4-mini uses LLaMA-style architecture. Hook path: `model.layers.{layer}`. | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-reasoning") | |
| tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-reasoning") | |
| # Read hook_name from the cfg you already loaded: | |
| # cfg["hook_name"] == "model.layers.6" (example — varies by SAE) | |
| hook_name = cfg["hook_name"] # e.g. "model.layers.6" | |
| # Navigate the submodule path and register a forward hook | |
| import functools | |
| submodule = functools.reduce(getattr, hook_name.split("."), model) | |
| activations = {} | |
| def hook_fn(module, input, output): | |
| # Most transformer layers return (hidden_states, ...) as a tuple | |
| h = output[0] if isinstance(output, tuple) else output | |
| activations["resid"] = h.detach() | |
| handle = submodule.register_forward_hook(hook_fn) | |
| inputs = tokenizer("Your text here", return_tensors="pt") | |
| with torch.no_grad(): | |
| model(**inputs) | |
| handle.remove() | |
| # activations["resid"]: [batch, seq_len, 3072] | |
| resid = activations["resid"][:, -1, :] # last token position | |
| ``` | |
| ### 3. Read feature activations | |
| ```python | |
| with torch.no_grad(): | |
| feature_acts = sae.encode(resid) # [batch, 12288] — sparse | |
| # Which features fired? | |
| active_features = feature_acts[0].nonzero(as_tuple=True)[0] | |
| top_features = feature_acts[0].topk(10) | |
| print("Active feature indices:", active_features.tolist()) | |
| print("Top-10 feature values:", top_features.values.tolist()) | |
| print("Top-10 feature indices:", top_features.indices.tolist()) | |
| # Reconstruct (for sanity check — should be close to resid) | |
| reconstruction = sae.decode(feature_acts) | |
| l2_error = (resid - reconstruction).norm(dim=-1).mean() | |
| ``` | |
| ### Caveats and known limitations | |
| **Hook names are HuggingFace `transformers`-style, not TransformerLens-style.** | |
| The `hook_name` in `cfg.json` (e.g. `"model.layers.6"`) is a submodule path in the standard | |
| HuggingFace model. SAELens' built-in activation-collection pipeline expects | |
| TransformerLens hook names (e.g. `blocks.14.hook_resid_post`). This means | |
| `SAE.from_pretrained()` with automatic model running **will not work** — use the | |
| manual forward-hook pattern above instead. | |
| **SAELens version requirements.** | |
| - `topk` architecture: SAELens ≥ 3.0 | |
| - `jumprelu` architecture: SAELens ≥ 3.0 | |
| - `gated` architecture: SAELens ≥ 3.5 (or load manually with `state_dict`) | |
| **These SAEs detect deceptive *behavior*, not deceptive *prompts**.* | |
| They were trained on response-level activations where the same prompt produced both | |
| deceptive and honest outputs. Feature activation differences reflect behavioral | |
| divergence, not prompt content. See the paper for experimental design details. | |
| ## Citation | |
| ```bibtex | |
| @article{thesecretagenda2025, | |
| title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools}, | |
| author={DeLeeuw, Caleb}, | |
| journal={arXiv:2509.20393}, | |
| year={2025} | |
| } | |
| ``` | |