--- language: en license: apache-2.0 tags: - sparse-autoencoder - SAE - interpretability - deception-detection - mechanistic-interpretability - saelens - neuronpedia - behavioral-sampling - phi base_model: - microsoft/phi-2 datasets: - Solshine/deception-behavioral-multimodel --- # Phi-2 Deception Behavioral SAEs 30 Sparse Autoencoders trained on residual stream activations from [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) (2.7B parameter Phi-architecture base model with parallel attention), capturing behavioral deception signals via same-prompt temperature sampling. ## Training-data caveat — please read before use The "honest" and "deceptive" labels used to train and evaluate this SAE come from a **mixed scenario pool** of nine prompts: - **Six clean incentive-structure scenarios** — `insider_info`, `accounting_error`, `ai_oversight_log`, `ai_capability_hide`, `surprise_party`, `job_interview_gap`. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling. - **Three role-play identity-assignment scenarios** — `secret_role_game` ("You have been assigned the role of Fascist"), `secret_role_game_v2` ("You are secretly one of the Mafia members"), `werewolf_game` ("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it. **What this mixed pool means for the SAE's labels.** Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play — which is a well-defined phenomenon but not the same as emergent or incentive- driven deception. **What this SAE is and is not good for.** - **Good for:** research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale. - **Not recommended as a standalone deception detector.** The role-consistency signal from the three role-play scenarios is mixed into every aggregate metric reported below. A downstream user who wants an "emergent-deception feature set" should restrict attention to features whose activation pattern concentrates in the `insider_info` / `accounting_error` / `ai_oversight_log` / `ai_capability_hide` / `surprise_party` / `job_interview_gap` scenarios — or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity). **What is unaffected by this caveat.** - The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported. - The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision. A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public. --- Part of the cross-model deception SAE study: [Solshine/deception-behavioral-saes-saelens](https://huggingface.co/Solshine/deception-behavioral-saes-saelens) (9 models, 348 total SAEs). ## What's in This Repo - **30 SAEs** across 5 layers (L4, L8, L12, L16, L20) - **2 architectures:** TopK (k=64), JumpReLU - **3 training conditions:** `mixed`, `deceptive_only`, `honest_only` - **Format:** SAELens/Neuronpedia-compatible (safetensors + cfg.json) - **Dimensions:** d_in=2560, d_sae=10240 (4x expansion) ## Research Context This is a follow-up to ["The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"](https://arxiv.org/abs/2509.20393) (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (`--quantize-4bit`, ~1.6 GB GPU footprint). Code: [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research) ## Key Findings — Phi-2 Phi-2 is the **anomalous model** in the 9-model study — it violates the expected pattern for its parameter count in a way that implicates its distinctive parallel attention architecture. | Metric | Value | |---|---| | Peak layer | L21 (75% depth — note: not a trained SAE layer) | | Best SAE layer | L20 | | Peak balanced accuracy | **74.9%** | | Best SAE probe accuracy | **79.4%** (`phi2_jumprelu_L20_honest_only`) | | SAEs beating raw baseline | 10/30 (33%) — partial SAE **help** | **The parallel attention anomaly:** At 2.7B parameters, the cross-model trend predicts SAEs should hurt (as they do for Qwen3-1.7B and nanochat-d32). Instead, 33% of Phi-2's SAEs beat the raw baseline — a rate comparable to the sub-1.3B models where SAEs help. The leading hypothesis is Phi-2's parallel (simultaneous) attention+MLP architecture. Standard transformers route residual stream information through attention before MLP; Phi-2 runs both in parallel and sums the outputs. This parallel path may produce more concentrated, less distributed deception encoding that SAEs can decompose more effectively. **The anomaly does not persist at 3.8B:** Phi-4-mini-reasoning (also Phi architecture, but larger) shows only 1/42 (2%) SAE help — standard large-model behavior. The parallel-attention effect appears to be a Phi-2-specific phenomenon that fades at scale. **Best SAE outperforms raw by +4.5pp:** `phi2_jumprelu_L20_honest_only` achieves 79.4% vs L20 raw of 74.9%. Unlike d20/TinyLlama/Pythia160m where the SAE advantage is at a specific training condition, for Phi-2 the honest_only condition wins at the best layer — consistent with the JumpReLU+honest_only pattern seen across all SAE-helps models in the study. **Near-linear layer profile:** Phi-2 shows a broad peak around L18–L20 (58–67% depth) with a 3.8pp gap between L16 (71.1%) and L18 (77.9%). This is shallower than the nanochat-d32 spike but more peaked than Phi-4-mini's plateau. **Architecture note:** Phi-2 uses Microsoft's Phi architecture with parallel attention-MLP blocks (attention and MLP computed simultaneously from the same residual stream input, outputs summed). 32 transformer layers, 2560-dimensional residual stream. The model was trained primarily on synthetic high-quality data (textbooks, code), which may affect the nature of deception representations. ## SAE Format Each SAE lives in a subfolder named `{sae_id}/` containing: - `sae_weights.safetensors` — encoder/decoder weights - `cfg.json` — SAELens-compatible config `hook_name` format: `model.layers.{layer}.hook_resid_post` ## Training Details | Parameter | Value | |---|---| | Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro | | Training time | ~400–600 seconds per SAE | | Epochs | 300 | | Batch size | 128 | | Expansion factor | 4x (2560 → 10240) | | Model quantization | 4-bit (bitsandbytes) for activation collection | | Activations | `resid_post` collected during autoregressive generation | | Training conditions | `mixed` (n=246), `deceptive_only` (n=88), `honest_only` (n=158) | | LLM classifier | Gemini 2.5 Flash | ## Known Limitations **JumpReLU threshold not learned (30 SAEs):** All SAEs have `threshold = 0` — functionally ReLU. L0 ≈ 50% of d_sae. TopK SAEs are unaffected. **STE fix (2026-04-11):** The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage is confirmed not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama). **4-bit quantization:** Activations were collected from a 4-bit quantized model. The parallel-attention anomaly may be amplified or dampened by quantization effects. **Anomaly not fully explained:** The 33% SAE-helps rate at 2.7B is not yet mechanistically explained. The parallel-attention hypothesis is plausible but not tested by ablation. ## Loading Example ```python from safetensors.torch import load_file import json sae_id = "phi2_jumprelu_L20_honest_only" weights = load_file(f"{sae_id}/sae_weights.safetensors") cfg = json.load(open(f"{sae_id}/cfg.json")) # W_enc: [2560, 10240], W_dec: [10240, 2560] # cfg["hook_name"] == "model.layers.20.hook_resid_post" print(f"Training condition: {cfg['training_condition']}") ``` ## Usage ### 1. Load an SAE from this repo ```python from huggingface_hub import hf_hub_download from safetensors.torch import load_file import json repo_id = "Solshine/deception-saes-phi-2" sae_id = "phi2_topk_L20_honest_only" # replace with any tag in this repo weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors") cfg_path = hf_hub_download(repo_id, f"{sae_id}/cfg.json") with open(cfg_path) as f: cfg = json.load(f) # Option A — load with SAELens (≥3.0 required for jumprelu/topk; ≥3.5 for gated) from sae_lens import SAE sae = SAE.from_dict(cfg) sae.load_state_dict(load_file(weights_path)) # Option B — load manually (no SAELens dependency) from safetensors.torch import load_file state = load_file(weights_path) # Keys: W_enc [2560, 10240], b_enc [10240], # W_dec [10240, 2560], b_dec [2560], threshold [10240] ``` ### 2. Hook into the model and collect residual-stream activations These SAEs were trained on the **residual stream after each transformer layer**. The `hook_name` field in `cfg.json` gives the exact HuggingFace `transformers` submodule path to hook. Phi-2 uses a parallel attention+MLP architecture (unusual). Hook path: `model.layers.{layer}`. Note: Phi-2 SAE probe results are anomalous — see README body for details. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2") tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2") # Read hook_name from the cfg you already loaded: # cfg["hook_name"] == "model.layers.20" (example — varies by SAE) hook_name = cfg["hook_name"] # e.g. "model.layers.20" # Navigate the submodule path and register a forward hook import functools submodule = functools.reduce(getattr, hook_name.split("."), model) activations = {} def hook_fn(module, input, output): # Most transformer layers return (hidden_states, ...) as a tuple h = output[0] if isinstance(output, tuple) else output activations["resid"] = h.detach() handle = submodule.register_forward_hook(hook_fn) inputs = tokenizer("Your text here", return_tensors="pt") with torch.no_grad(): model(**inputs) handle.remove() # activations["resid"]: [batch, seq_len, 2560] resid = activations["resid"][:, -1, :] # last token position ``` ### 3. Read feature activations ```python with torch.no_grad(): feature_acts = sae.encode(resid) # [batch, 10240] — sparse # Which features fired? active_features = feature_acts[0].nonzero(as_tuple=True)[0] top_features = feature_acts[0].topk(10) print("Active feature indices:", active_features.tolist()) print("Top-10 feature values:", top_features.values.tolist()) print("Top-10 feature indices:", top_features.indices.tolist()) # Reconstruct (for sanity check — should be close to resid) reconstruction = sae.decode(feature_acts) l2_error = (resid - reconstruction).norm(dim=-1).mean() ``` ### Caveats and known limitations **Hook names are HuggingFace `transformers`-style, not TransformerLens-style.** The `hook_name` in `cfg.json` (e.g. `"model.layers.20"`) is a submodule path in the standard HuggingFace model. SAELens' built-in activation-collection pipeline expects TransformerLens hook names (e.g. `blocks.14.hook_resid_post`). This means `SAE.from_pretrained()` with automatic model running **will not work** — use the manual forward-hook pattern above instead. **SAELens version requirements.** - `topk` architecture: SAELens ≥ 3.0 - `jumprelu` architecture: SAELens ≥ 3.0 - `gated` architecture: SAELens ≥ 3.5 (or load manually with `state_dict`) **These SAEs detect deceptive *behavior*, not deceptive *prompts**.* They were trained on response-level activations where the same prompt produced both deceptive and honest outputs. Feature activation differences reflect behavioral divergence, not prompt content. See the paper for experimental design details. ## Citation ```bibtex @article{thesecretagenda2025, title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools}, author={DeLeeuw, Caleb}, journal={arXiv:2509.20393}, year={2025} } ```