SAELens
English
sparse-autoencoder
SAE
interpretability
deception-detection
mechanistic-interpretability
neuronpedia
behavioral-sampling
phi
reasoning
Instructions to use Solshine/deception-saes-phi-4-mini-reasoning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- SAELens
How to use Solshine/deception-saes-phi-4-mini-reasoning with SAELens:
# pip install sae-lens from sae_lens import SAE sae, cfg_dict, sparsity = SAE.from_pretrained( release = "RELEASE_ID", # e.g., "gpt2-small-res-jb". See other options in https://github.com/jbloomAus/SAELens/blob/main/sae_lens/pretrained_saes.yaml sae_id = "SAE_ID", # e.g., "blocks.8.hook_resid_pre". Won't always be a hook point ) - Notebooks
- Google Colab
- Kaggle
File size: 13,356 Bytes
5e90142 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 | ---
language: en
license: apache-2.0
tags:
- sparse-autoencoder
- SAE
- interpretability
- deception-detection
- mechanistic-interpretability
- saelens
- neuronpedia
- behavioral-sampling
- phi
- reasoning
base_model:
- microsoft/Phi-4-mini-reasoning
datasets:
- Solshine/deception-behavioral-multimodel
---
# Phi-4-mini-reasoning Deception Behavioral SAEs
42 Sparse Autoencoders trained on residual stream activations from [microsoft/Phi-4-mini-reasoning](https://huggingface.co/microsoft/Phi-4-mini-reasoning) (3.8B parameter Phi-architecture reasoning-fine-tuned model), capturing behavioral deception signals via same-prompt temperature sampling.
## Training-data caveat β please read before use
The "honest" and "deceptive" labels used to train and evaluate this SAE
come from a **mixed scenario pool** of nine prompts:
- **Six clean incentive-structure scenarios** β `insider_info`,
`accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
`surprise_party`, `job_interview_gap`. Each presents a realistic
situation with a non-trivial incentive to conceal or disclose
information; the honest/deceptive distinction is which of the two
behavioral choices the model's completion settles into under
temperature sampling.
- **Three role-play identity-assignment scenarios** β
`secret_role_game` ("You have been assigned the role of Fascist"),
`secret_role_game_v2` ("You are secretly one of the Mafia members"),
`werewolf_game` ("You are a Werewolf"). These pre-assign the model a
deceptive identity and label a completion "deceptive" when the model
drifts away from the assigned role or "honest" when it echoes it.
**What this mixed pool means for the SAE's labels.** Within the six
incentive-structure scenarios, the honest/deceptive distinction is a
measurement of behavioral choice under an ambiguous incentive. Within
the three role-play scenarios, the distinction is a measurement of
role-consistency under identity-assigned role-play β which is a
well-defined phenomenon but not the same as emergent or incentive-
driven deception.
**What this SAE is and is not good for.**
- **Good for:** research on mixed-pool activation geometry; SAE
feature-geometry studies; as one of a set of baselines when
comparing multiple SAE families; as a reference implementation of
same-prompt temperature-sampled behavioral SAE training at scale.
- **Not recommended as a standalone deception detector.** The
role-consistency signal from the three role-play scenarios is mixed
into every aggregate metric reported below. A downstream user who
wants an "emergent-deception feature set" should restrict attention
to features whose activation pattern concentrates in the
`insider_info` / `accounting_error` / `ai_oversight_log` /
`ai_capability_hide` / `surprise_party` / `job_interview_gap`
scenarios β or wait for the methodologically corrected V3 re-release
currently in preparation on the decision-incentive scenario bank
(no pre-assigned deceptive identity).
**What is unaffected by this caveat.**
- The SAE weights, reconstruction metrics (explained variance, L0,
alive features), and engineering of the training pipeline are
accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper
measure the mixed pool; the 6-scenario clean-subset re-analysis is
listed as a planned appendix for the next manuscript revision.
A companion methodology-first Gemma 4 SAE suite is in preparation using
pretraining-distribution data + a decision-incentive behavior split;
this README will be updated with a link when that release is public.
---
Part of the cross-model deception SAE study: [Solshine/deception-behavioral-saes-saelens](https://huggingface.co/Solshine/deception-behavioral-saes-saelens) (9 models, 348 total SAEs).
## What's in This Repo
- **42 SAEs** across 7 layers (L2, L6, L10, L14, L18, L22, L26)
- **2 architectures:** TopK (k=64), JumpReLU
- **3 training conditions:** `mixed`, `deceptive_only`, `honest_only`
- **Format:** SAELens/Neuronpedia-compatible (safetensors + cfg.json)
- **Dimensions:** d_in=3072, d_sae=12288 (4x expansion)
## Research Context
This is a follow-up to ["The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"](https://arxiv.org/abs/2509.20393) (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (`--quantize-4bit`, ~1.9 GB GPU footprint) to fit within the 4 GB VRAM constraint.
Code: [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research)
## Key Findings β Phi-4-mini-reasoning
Phi-4-mini-reasoning is the **largest model** in the 9-model study and the only reasoning-fine-tuned model included.
| Metric | Value |
|---|---|
| Peak layer | L20 (64% depth) |
| Peak balanced accuracy | **80.8%** |
| Peak AUROC | **0.860** |
| Best SAE probe accuracy | **81.0%** (`phi4_mini_jumprelu_L6_honest_only`) |
| SAEs beating raw baseline | 1/42 (2%) β SAEs **hurt** detection |
**Most striking finding β broad plateau across all 32 layers:** Unlike every other model in the study, Phi-4-mini-reasoning maintains balanced accuracy β₯74% across ALL 32 layers. No other model approaches this uniformity. The hypothesis is that reasoning fine-tuning distributes deception representations more evenly throughout the network, preventing the sharp mid-network peak seen in base models like nanochat-d32 and Qwen3.
**Phi architecture anomaly does not persist at 3.8B:** The smaller Phi-2 (2.7B) shows an anomalous 33% SAE-helps rate, hypothesized to arise from Phi-2's parallel attention-MLP architecture. Phi-4-mini follows the standard large-model pattern (only 1/42 SAEs marginally beat raw, 2%), confirming that the parallel-attention effect fades or is overridden as model scale increases.
**Reasoning fine-tuning context:** Phi-4-mini-reasoning is trained for chain-of-thought mathematical reasoning. Its high deception detection accuracy (80.8%) despite being fine-tuned away from general language completion suggests that reasoning alignment does not suppress the deception-correlated geometric structures in residual stream activations.
**SAE decomposition hurts:** Only 1/42 SAEs marginally beats the raw baseline, and by only +0.2pp. The large model joins nanochat-d32 and Qwen3 in the "SAEs hurt" camp β confirming the 1.3Bβ1.7B transition as the boundary between SAE-helps and SAE-hurts regimes.
**Architecture note:** Phi-4-mini uses Microsoft's Phi architecture with 32 transformer layers, 3072-dimensional residual stream, shared input/output embeddings, and an extensive instruction+reasoning fine-tuning curriculum. The `device_map={"":"cuda:0"}` kwarg is required for 4-bit quantization to function correctly on single-GPU setups.
## SAE Format
Each SAE lives in a subfolder named `{sae_id}/` containing:
- `sae_weights.safetensors` β encoder/decoder weights
- `cfg.json` β SAELens-compatible config
`hook_name` format: `model.layers.{layer}.hook_resid_post`
## Training Details
| Parameter | Value |
|---|---|
| Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro |
| Training time | ~400β600 seconds per SAE |
| Epochs | 300 |
| Batch size | 128 |
| Expansion factor | 4x (3072 β 12288) |
| Model quantization | 4-bit (bitsandbytes) for activation collection |
| Activations | `resid_post` collected during autoregressive generation |
| Training conditions | `mixed` (n=252), `deceptive_only` (n=123), `honest_only` (n=129) |
| LLM classifier | Gemini 2.5 Flash |
## Known Limitations
**JumpReLU threshold not learned (42 SAEs):** All SAEs in this repo have `threshold = 0` β functionally ReLU. L0 β 50% of d_sae. TopK SAEs are unaffected (exact k=64).
**STE fix (2026-04-11):** The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage over TopK is confirmed as not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama confirm).
**4-bit quantization:** Activations were collected from a 4-bit quantized model. Quantization may introduce noise in residual stream representations; the true (unquantized) signal could differ somewhat from reported numbers.
**Small dataset:** n=252 is the smallest sample count among the 1B+ models, reducing probe reliability and SAE training quality.
## Loading Example
```python
from safetensors.torch import load_file
import json
sae_id = "phi4_mini_jumprelu_L6_honest_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))
# W_enc: [3072, 12288], W_dec: [12288, 3072]
# cfg["hook_name"] == "model.layers.6.hook_resid_post"
print(f"d_in={cfg['d_in']}, d_sae={cfg['d_sae']}")
```
## Usage
### 1. Load an SAE from this repo
```python
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json
repo_id = "Solshine/deception-saes-phi-4-mini-reasoning"
sae_id = "phi4_mini_topk_L6_honest_only" # replace with any tag in this repo
weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path = hf_hub_download(repo_id, f"{sae_id}/cfg.json")
with open(cfg_path) as f:
cfg = json.load(f)
# Option A β load with SAELens (β₯3.0 required for jumprelu/topk; β₯3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))
# Option B β load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [3072, 12288], b_enc [12288],
# W_dec [12288, 3072], b_dec [3072], threshold [12288]
```
### 2. Hook into the model and collect residual-stream activations
These SAEs were trained on the **residual stream after each transformer layer**.
The `hook_name` field in `cfg.json` gives the exact HuggingFace `transformers`
submodule path to hook. Phi-4-mini uses LLaMA-style architecture. Hook path: `model.layers.{layer}`.
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-reasoning")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-reasoning")
# Read hook_name from the cfg you already loaded:
# cfg["hook_name"] == "model.layers.6" (example β varies by SAE)
hook_name = cfg["hook_name"] # e.g. "model.layers.6"
# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)
activations = {}
def hook_fn(module, input, output):
# Most transformer layers return (hidden_states, ...) as a tuple
h = output[0] if isinstance(output, tuple) else output
activations["resid"] = h.detach()
handle = submodule.register_forward_hook(hook_fn)
inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
model(**inputs)
handle.remove()
# activations["resid"]: [batch, seq_len, 3072]
resid = activations["resid"][:, -1, :] # last token position
```
### 3. Read feature activations
```python
with torch.no_grad():
feature_acts = sae.encode(resid) # [batch, 12288] β sparse
# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features = feature_acts[0].topk(10)
print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:", top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())
# Reconstruct (for sanity check β should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()
```
### Caveats and known limitations
**Hook names are HuggingFace `transformers`-style, not TransformerLens-style.**
The `hook_name` in `cfg.json` (e.g. `"model.layers.6"`) is a submodule path in the standard
HuggingFace model. SAELens' built-in activation-collection pipeline expects
TransformerLens hook names (e.g. `blocks.14.hook_resid_post`). This means
`SAE.from_pretrained()` with automatic model running **will not work** β use the
manual forward-hook pattern above instead.
**SAELens version requirements.**
- `topk` architecture: SAELens β₯ 3.0
- `jumprelu` architecture: SAELens β₯ 3.0
- `gated` architecture: SAELens β₯ 3.5 (or load manually with `state_dict`)
**These SAEs detect deceptive *behavior*, not deceptive *prompts**.*
They were trained on response-level activations where the same prompt produced both
deceptive and honest outputs. Feature activation differences reflect behavioral
divergence, not prompt content. See the paper for experimental design details.
## Citation
```bibtex
@article{thesecretagenda2025,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb},
journal={arXiv:2509.20393},
year={2025}
}
```
|