Solshine's picture
Initial public release: SAE weights, cfg, and model card
5e90142
---
language: en
license: apache-2.0
tags:
- sparse-autoencoder
- SAE
- interpretability
- deception-detection
- mechanistic-interpretability
- saelens
- neuronpedia
- behavioral-sampling
- phi
- reasoning
base_model:
- microsoft/Phi-4-mini-reasoning
datasets:
- Solshine/deception-behavioral-multimodel
---
# Phi-4-mini-reasoning Deception Behavioral SAEs
42 Sparse Autoencoders trained on residual stream activations from [microsoft/Phi-4-mini-reasoning](https://huggingface.co/microsoft/Phi-4-mini-reasoning) (3.8B parameter Phi-architecture reasoning-fine-tuned model), capturing behavioral deception signals via same-prompt temperature sampling.
## Training-data caveat — please read before use
The "honest" and "deceptive" labels used to train and evaluate this SAE
come from a **mixed scenario pool** of nine prompts:
- **Six clean incentive-structure scenarios** — `insider_info`,
`accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
`surprise_party`, `job_interview_gap`. Each presents a realistic
situation with a non-trivial incentive to conceal or disclose
information; the honest/deceptive distinction is which of the two
behavioral choices the model's completion settles into under
temperature sampling.
- **Three role-play identity-assignment scenarios** —
`secret_role_game` ("You have been assigned the role of Fascist"),
`secret_role_game_v2` ("You are secretly one of the Mafia members"),
`werewolf_game` ("You are a Werewolf"). These pre-assign the model a
deceptive identity and label a completion "deceptive" when the model
drifts away from the assigned role or "honest" when it echoes it.
**What this mixed pool means for the SAE's labels.** Within the six
incentive-structure scenarios, the honest/deceptive distinction is a
measurement of behavioral choice under an ambiguous incentive. Within
the three role-play scenarios, the distinction is a measurement of
role-consistency under identity-assigned role-play — which is a
well-defined phenomenon but not the same as emergent or incentive-
driven deception.
**What this SAE is and is not good for.**
- **Good for:** research on mixed-pool activation geometry; SAE
feature-geometry studies; as one of a set of baselines when
comparing multiple SAE families; as a reference implementation of
same-prompt temperature-sampled behavioral SAE training at scale.
- **Not recommended as a standalone deception detector.** The
role-consistency signal from the three role-play scenarios is mixed
into every aggregate metric reported below. A downstream user who
wants an "emergent-deception feature set" should restrict attention
to features whose activation pattern concentrates in the
`insider_info` / `accounting_error` / `ai_oversight_log` /
`ai_capability_hide` / `surprise_party` / `job_interview_gap`
scenarios — or wait for the methodologically corrected V3 re-release
currently in preparation on the decision-incentive scenario bank
(no pre-assigned deceptive identity).
**What is unaffected by this caveat.**
- The SAE weights, reconstruction metrics (explained variance, L0,
alive features), and engineering of the training pipeline are
accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper
measure the mixed pool; the 6-scenario clean-subset re-analysis is
listed as a planned appendix for the next manuscript revision.
A companion methodology-first Gemma 4 SAE suite is in preparation using
pretraining-distribution data + a decision-incentive behavior split;
this README will be updated with a link when that release is public.
---
Part of the cross-model deception SAE study: [Solshine/deception-behavioral-saes-saelens](https://huggingface.co/Solshine/deception-behavioral-saes-saelens) (9 models, 348 total SAEs).
## What's in This Repo
- **42 SAEs** across 7 layers (L2, L6, L10, L14, L18, L22, L26)
- **2 architectures:** TopK (k=64), JumpReLU
- **3 training conditions:** `mixed`, `deceptive_only`, `honest_only`
- **Format:** SAELens/Neuronpedia-compatible (safetensors + cfg.json)
- **Dimensions:** d_in=3072, d_sae=12288 (4x expansion)
## Research Context
This is a follow-up to ["The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"](https://arxiv.org/abs/2509.20393) (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (`--quantize-4bit`, ~1.9 GB GPU footprint) to fit within the 4 GB VRAM constraint.
Code: [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research)
## Key Findings — Phi-4-mini-reasoning
Phi-4-mini-reasoning is the **largest model** in the 9-model study and the only reasoning-fine-tuned model included.
| Metric | Value |
|---|---|
| Peak layer | L20 (64% depth) |
| Peak balanced accuracy | **80.8%** |
| Peak AUROC | **0.860** |
| Best SAE probe accuracy | **81.0%** (`phi4_mini_jumprelu_L6_honest_only`) |
| SAEs beating raw baseline | 1/42 (2%) — SAEs **hurt** detection |
**Most striking finding — broad plateau across all 32 layers:** Unlike every other model in the study, Phi-4-mini-reasoning maintains balanced accuracy ≥74% across ALL 32 layers. No other model approaches this uniformity. The hypothesis is that reasoning fine-tuning distributes deception representations more evenly throughout the network, preventing the sharp mid-network peak seen in base models like nanochat-d32 and Qwen3.
**Phi architecture anomaly does not persist at 3.8B:** The smaller Phi-2 (2.7B) shows an anomalous 33% SAE-helps rate, hypothesized to arise from Phi-2's parallel attention-MLP architecture. Phi-4-mini follows the standard large-model pattern (only 1/42 SAEs marginally beat raw, 2%), confirming that the parallel-attention effect fades or is overridden as model scale increases.
**Reasoning fine-tuning context:** Phi-4-mini-reasoning is trained for chain-of-thought mathematical reasoning. Its high deception detection accuracy (80.8%) despite being fine-tuned away from general language completion suggests that reasoning alignment does not suppress the deception-correlated geometric structures in residual stream activations.
**SAE decomposition hurts:** Only 1/42 SAEs marginally beats the raw baseline, and by only +0.2pp. The large model joins nanochat-d32 and Qwen3 in the "SAEs hurt" camp — confirming the 1.3B–1.7B transition as the boundary between SAE-helps and SAE-hurts regimes.
**Architecture note:** Phi-4-mini uses Microsoft's Phi architecture with 32 transformer layers, 3072-dimensional residual stream, shared input/output embeddings, and an extensive instruction+reasoning fine-tuning curriculum. The `device_map={"":"cuda:0"}` kwarg is required for 4-bit quantization to function correctly on single-GPU setups.
## SAE Format
Each SAE lives in a subfolder named `{sae_id}/` containing:
- `sae_weights.safetensors` — encoder/decoder weights
- `cfg.json` — SAELens-compatible config
`hook_name` format: `model.layers.{layer}.hook_resid_post`
## Training Details
| Parameter | Value |
|---|---|
| Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro |
| Training time | ~400–600 seconds per SAE |
| Epochs | 300 |
| Batch size | 128 |
| Expansion factor | 4x (3072 → 12288) |
| Model quantization | 4-bit (bitsandbytes) for activation collection |
| Activations | `resid_post` collected during autoregressive generation |
| Training conditions | `mixed` (n=252), `deceptive_only` (n=123), `honest_only` (n=129) |
| LLM classifier | Gemini 2.5 Flash |
## Known Limitations
**JumpReLU threshold not learned (42 SAEs):** All SAEs in this repo have `threshold = 0` — functionally ReLU. L0 ≈ 50% of d_sae. TopK SAEs are unaffected (exact k=64).
**STE fix (2026-04-11):** The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage over TopK is confirmed as not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama confirm).
**4-bit quantization:** Activations were collected from a 4-bit quantized model. Quantization may introduce noise in residual stream representations; the true (unquantized) signal could differ somewhat from reported numbers.
**Small dataset:** n=252 is the smallest sample count among the 1B+ models, reducing probe reliability and SAE training quality.
## Loading Example
```python
from safetensors.torch import load_file
import json
sae_id = "phi4_mini_jumprelu_L6_honest_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))
# W_enc: [3072, 12288], W_dec: [12288, 3072]
# cfg["hook_name"] == "model.layers.6.hook_resid_post"
print(f"d_in={cfg['d_in']}, d_sae={cfg['d_sae']}")
```
## Usage
### 1. Load an SAE from this repo
```python
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json
repo_id = "Solshine/deception-saes-phi-4-mini-reasoning"
sae_id = "phi4_mini_topk_L6_honest_only" # replace with any tag in this repo
weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path = hf_hub_download(repo_id, f"{sae_id}/cfg.json")
with open(cfg_path) as f:
cfg = json.load(f)
# Option A — load with SAELens (≥3.0 required for jumprelu/topk; ≥3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))
# Option B — load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [3072, 12288], b_enc [12288],
# W_dec [12288, 3072], b_dec [3072], threshold [12288]
```
### 2. Hook into the model and collect residual-stream activations
These SAEs were trained on the **residual stream after each transformer layer**.
The `hook_name` field in `cfg.json` gives the exact HuggingFace `transformers`
submodule path to hook. Phi-4-mini uses LLaMA-style architecture. Hook path: `model.layers.{layer}`.
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-reasoning")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-reasoning")
# Read hook_name from the cfg you already loaded:
# cfg["hook_name"] == "model.layers.6" (example — varies by SAE)
hook_name = cfg["hook_name"] # e.g. "model.layers.6"
# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)
activations = {}
def hook_fn(module, input, output):
# Most transformer layers return (hidden_states, ...) as a tuple
h = output[0] if isinstance(output, tuple) else output
activations["resid"] = h.detach()
handle = submodule.register_forward_hook(hook_fn)
inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
model(**inputs)
handle.remove()
# activations["resid"]: [batch, seq_len, 3072]
resid = activations["resid"][:, -1, :] # last token position
```
### 3. Read feature activations
```python
with torch.no_grad():
feature_acts = sae.encode(resid) # [batch, 12288] — sparse
# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features = feature_acts[0].topk(10)
print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:", top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())
# Reconstruct (for sanity check — should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()
```
### Caveats and known limitations
**Hook names are HuggingFace `transformers`-style, not TransformerLens-style.**
The `hook_name` in `cfg.json` (e.g. `"model.layers.6"`) is a submodule path in the standard
HuggingFace model. SAELens' built-in activation-collection pipeline expects
TransformerLens hook names (e.g. `blocks.14.hook_resid_post`). This means
`SAE.from_pretrained()` with automatic model running **will not work** — use the
manual forward-hook pattern above instead.
**SAELens version requirements.**
- `topk` architecture: SAELens ≥ 3.0
- `jumprelu` architecture: SAELens ≥ 3.0
- `gated` architecture: SAELens ≥ 3.5 (or load manually with `state_dict`)
**These SAEs detect deceptive *behavior*, not deceptive *prompts**.*
They were trained on response-level activations where the same prompt produced both
deceptive and honest outputs. Feature activation differences reflect behavioral
divergence, not prompt content. See the paper for experimental design details.
## Citation
```bibtex
@article{thesecretagenda2025,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb},
journal={arXiv:2509.20393},
year={2025}
}
```