SAELens
English
sparse-autoencoder
SAE
interpretability
deception-detection
mechanistic-interpretability
neuronpedia
behavioral-sampling
phi
Instructions to use Solshine/deception-saes-phi-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- SAELens
How to use Solshine/deception-saes-phi-2 with SAELens:
# pip install sae-lens from sae_lens import SAE sae, cfg_dict, sparsity = SAE.from_pretrained( release = "RELEASE_ID", # e.g., "gpt2-small-res-jb". See other options in https://github.com/jbloomAus/SAELens/blob/main/sae_lens/pretrained_saes.yaml sae_id = "SAE_ID", # e.g., "blocks.8.hook_resid_pre". Won't always be a hook point ) - Notebooks
- Google Colab
- Kaggle
File size: 13,402 Bytes
80cca10 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 | ---
language: en
license: apache-2.0
tags:
- sparse-autoencoder
- SAE
- interpretability
- deception-detection
- mechanistic-interpretability
- saelens
- neuronpedia
- behavioral-sampling
- phi
base_model:
- microsoft/phi-2
datasets:
- Solshine/deception-behavioral-multimodel
---
# Phi-2 Deception Behavioral SAEs
30 Sparse Autoencoders trained on residual stream activations from [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) (2.7B parameter Phi-architecture base model with parallel attention), capturing behavioral deception signals via same-prompt temperature sampling.
## Training-data caveat β please read before use
The "honest" and "deceptive" labels used to train and evaluate this SAE
come from a **mixed scenario pool** of nine prompts:
- **Six clean incentive-structure scenarios** β `insider_info`,
`accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
`surprise_party`, `job_interview_gap`. Each presents a realistic
situation with a non-trivial incentive to conceal or disclose
information; the honest/deceptive distinction is which of the two
behavioral choices the model's completion settles into under
temperature sampling.
- **Three role-play identity-assignment scenarios** β
`secret_role_game` ("You have been assigned the role of Fascist"),
`secret_role_game_v2` ("You are secretly one of the Mafia members"),
`werewolf_game` ("You are a Werewolf"). These pre-assign the model a
deceptive identity and label a completion "deceptive" when the model
drifts away from the assigned role or "honest" when it echoes it.
**What this mixed pool means for the SAE's labels.** Within the six
incentive-structure scenarios, the honest/deceptive distinction is a
measurement of behavioral choice under an ambiguous incentive. Within
the three role-play scenarios, the distinction is a measurement of
role-consistency under identity-assigned role-play β which is a
well-defined phenomenon but not the same as emergent or incentive-
driven deception.
**What this SAE is and is not good for.**
- **Good for:** research on mixed-pool activation geometry; SAE
feature-geometry studies; as one of a set of baselines when
comparing multiple SAE families; as a reference implementation of
same-prompt temperature-sampled behavioral SAE training at scale.
- **Not recommended as a standalone deception detector.** The
role-consistency signal from the three role-play scenarios is mixed
into every aggregate metric reported below. A downstream user who
wants an "emergent-deception feature set" should restrict attention
to features whose activation pattern concentrates in the
`insider_info` / `accounting_error` / `ai_oversight_log` /
`ai_capability_hide` / `surprise_party` / `job_interview_gap`
scenarios β or wait for the methodologically corrected V3 re-release
currently in preparation on the decision-incentive scenario bank
(no pre-assigned deceptive identity).
**What is unaffected by this caveat.**
- The SAE weights, reconstruction metrics (explained variance, L0,
alive features), and engineering of the training pipeline are
accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper
measure the mixed pool; the 6-scenario clean-subset re-analysis is
listed as a planned appendix for the next manuscript revision.
A companion methodology-first Gemma 4 SAE suite is in preparation using
pretraining-distribution data + a decision-incentive behavior split;
this README will be updated with a link when that release is public.
---
Part of the cross-model deception SAE study: [Solshine/deception-behavioral-saes-saelens](https://huggingface.co/Solshine/deception-behavioral-saes-saelens) (9 models, 348 total SAEs).
## What's in This Repo
- **30 SAEs** across 5 layers (L4, L8, L12, L16, L20)
- **2 architectures:** TopK (k=64), JumpReLU
- **3 training conditions:** `mixed`, `deceptive_only`, `honest_only`
- **Format:** SAELens/Neuronpedia-compatible (safetensors + cfg.json)
- **Dimensions:** d_in=2560, d_sae=10240 (4x expansion)
## Research Context
This is a follow-up to ["The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"](https://arxiv.org/abs/2509.20393) (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (`--quantize-4bit`, ~1.6 GB GPU footprint).
Code: [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research)
## Key Findings β Phi-2
Phi-2 is the **anomalous model** in the 9-model study β it violates the expected pattern for its parameter count in a way that implicates its distinctive parallel attention architecture.
| Metric | Value |
|---|---|
| Peak layer | L21 (75% depth β note: not a trained SAE layer) |
| Best SAE layer | L20 |
| Peak balanced accuracy | **74.9%** |
| Best SAE probe accuracy | **79.4%** (`phi2_jumprelu_L20_honest_only`) |
| SAEs beating raw baseline | 10/30 (33%) β partial SAE **help** |
**The parallel attention anomaly:** At 2.7B parameters, the cross-model trend predicts SAEs should hurt (as they do for Qwen3-1.7B and nanochat-d32). Instead, 33% of Phi-2's SAEs beat the raw baseline β a rate comparable to the sub-1.3B models where SAEs help. The leading hypothesis is Phi-2's parallel (simultaneous) attention+MLP architecture. Standard transformers route residual stream information through attention before MLP; Phi-2 runs both in parallel and sums the outputs. This parallel path may produce more concentrated, less distributed deception encoding that SAEs can decompose more effectively.
**The anomaly does not persist at 3.8B:** Phi-4-mini-reasoning (also Phi architecture, but larger) shows only 1/42 (2%) SAE help β standard large-model behavior. The parallel-attention effect appears to be a Phi-2-specific phenomenon that fades at scale.
**Best SAE outperforms raw by +4.5pp:** `phi2_jumprelu_L20_honest_only` achieves 79.4% vs L20 raw of 74.9%. Unlike d20/TinyLlama/Pythia160m where the SAE advantage is at a specific training condition, for Phi-2 the honest_only condition wins at the best layer β consistent with the JumpReLU+honest_only pattern seen across all SAE-helps models in the study.
**Near-linear layer profile:** Phi-2 shows a broad peak around L18βL20 (58β67% depth) with a 3.8pp gap between L16 (71.1%) and L18 (77.9%). This is shallower than the nanochat-d32 spike but more peaked than Phi-4-mini's plateau.
**Architecture note:** Phi-2 uses Microsoft's Phi architecture with parallel attention-MLP blocks (attention and MLP computed simultaneously from the same residual stream input, outputs summed). 32 transformer layers, 2560-dimensional residual stream. The model was trained primarily on synthetic high-quality data (textbooks, code), which may affect the nature of deception representations.
## SAE Format
Each SAE lives in a subfolder named `{sae_id}/` containing:
- `sae_weights.safetensors` β encoder/decoder weights
- `cfg.json` β SAELens-compatible config
`hook_name` format: `model.layers.{layer}.hook_resid_post`
## Training Details
| Parameter | Value |
|---|---|
| Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro |
| Training time | ~400β600 seconds per SAE |
| Epochs | 300 |
| Batch size | 128 |
| Expansion factor | 4x (2560 β 10240) |
| Model quantization | 4-bit (bitsandbytes) for activation collection |
| Activations | `resid_post` collected during autoregressive generation |
| Training conditions | `mixed` (n=246), `deceptive_only` (n=88), `honest_only` (n=158) |
| LLM classifier | Gemini 2.5 Flash |
## Known Limitations
**JumpReLU threshold not learned (30 SAEs):** All SAEs have `threshold = 0` β functionally ReLU. L0 β 50% of d_sae. TopK SAEs are unaffected.
**STE fix (2026-04-11):** The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage is confirmed not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama).
**4-bit quantization:** Activations were collected from a 4-bit quantized model. The parallel-attention anomaly may be amplified or dampened by quantization effects.
**Anomaly not fully explained:** The 33% SAE-helps rate at 2.7B is not yet mechanistically explained. The parallel-attention hypothesis is plausible but not tested by ablation.
## Loading Example
```python
from safetensors.torch import load_file
import json
sae_id = "phi2_jumprelu_L20_honest_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))
# W_enc: [2560, 10240], W_dec: [10240, 2560]
# cfg["hook_name"] == "model.layers.20.hook_resid_post"
print(f"Training condition: {cfg['training_condition']}")
```
## Usage
### 1. Load an SAE from this repo
```python
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json
repo_id = "Solshine/deception-saes-phi-2"
sae_id = "phi2_topk_L20_honest_only" # replace with any tag in this repo
weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path = hf_hub_download(repo_id, f"{sae_id}/cfg.json")
with open(cfg_path) as f:
cfg = json.load(f)
# Option A β load with SAELens (β₯3.0 required for jumprelu/topk; β₯3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))
# Option B β load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [2560, 10240], b_enc [10240],
# W_dec [10240, 2560], b_dec [2560], threshold [10240]
```
### 2. Hook into the model and collect residual-stream activations
These SAEs were trained on the **residual stream after each transformer layer**.
The `hook_name` field in `cfg.json` gives the exact HuggingFace `transformers`
submodule path to hook. Phi-2 uses a parallel attention+MLP architecture (unusual). Hook path: `model.layers.{layer}`. Note: Phi-2 SAE probe results are anomalous β see README body for details.
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
# Read hook_name from the cfg you already loaded:
# cfg["hook_name"] == "model.layers.20" (example β varies by SAE)
hook_name = cfg["hook_name"] # e.g. "model.layers.20"
# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)
activations = {}
def hook_fn(module, input, output):
# Most transformer layers return (hidden_states, ...) as a tuple
h = output[0] if isinstance(output, tuple) else output
activations["resid"] = h.detach()
handle = submodule.register_forward_hook(hook_fn)
inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
model(**inputs)
handle.remove()
# activations["resid"]: [batch, seq_len, 2560]
resid = activations["resid"][:, -1, :] # last token position
```
### 3. Read feature activations
```python
with torch.no_grad():
feature_acts = sae.encode(resid) # [batch, 10240] β sparse
# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features = feature_acts[0].topk(10)
print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:", top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())
# Reconstruct (for sanity check β should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()
```
### Caveats and known limitations
**Hook names are HuggingFace `transformers`-style, not TransformerLens-style.**
The `hook_name` in `cfg.json` (e.g. `"model.layers.20"`) is a submodule path in the standard
HuggingFace model. SAELens' built-in activation-collection pipeline expects
TransformerLens hook names (e.g. `blocks.14.hook_resid_post`). This means
`SAE.from_pretrained()` with automatic model running **will not work** β use the
manual forward-hook pattern above instead.
**SAELens version requirements.**
- `topk` architecture: SAELens β₯ 3.0
- `jumprelu` architecture: SAELens β₯ 3.0
- `gated` architecture: SAELens β₯ 3.5 (or load manually with `state_dict`)
**These SAEs detect deceptive *behavior*, not deceptive *prompts**.*
They were trained on response-level activations where the same prompt produced both
deceptive and honest outputs. Feature activation differences reflect behavioral
divergence, not prompt content. See the paper for experimental design details.
## Citation
```bibtex
@article{thesecretagenda2025,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb},
journal={arXiv:2509.20393},
year={2025}
}
```
|