Solshine's picture
Initial public release: SAE weights, cfg, and model card
80cca10
---
language: en
license: apache-2.0
tags:
- sparse-autoencoder
- SAE
- interpretability
- deception-detection
- mechanistic-interpretability
- saelens
- neuronpedia
- behavioral-sampling
- phi
base_model:
- microsoft/phi-2
datasets:
- Solshine/deception-behavioral-multimodel
---
# Phi-2 Deception Behavioral SAEs
30 Sparse Autoencoders trained on residual stream activations from [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) (2.7B parameter Phi-architecture base model with parallel attention), capturing behavioral deception signals via same-prompt temperature sampling.
## Training-data caveat β€” please read before use
The "honest" and "deceptive" labels used to train and evaluate this SAE
come from a **mixed scenario pool** of nine prompts:
- **Six clean incentive-structure scenarios** β€” `insider_info`,
`accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
`surprise_party`, `job_interview_gap`. Each presents a realistic
situation with a non-trivial incentive to conceal or disclose
information; the honest/deceptive distinction is which of the two
behavioral choices the model's completion settles into under
temperature sampling.
- **Three role-play identity-assignment scenarios** β€”
`secret_role_game` ("You have been assigned the role of Fascist"),
`secret_role_game_v2` ("You are secretly one of the Mafia members"),
`werewolf_game` ("You are a Werewolf"). These pre-assign the model a
deceptive identity and label a completion "deceptive" when the model
drifts away from the assigned role or "honest" when it echoes it.
**What this mixed pool means for the SAE's labels.** Within the six
incentive-structure scenarios, the honest/deceptive distinction is a
measurement of behavioral choice under an ambiguous incentive. Within
the three role-play scenarios, the distinction is a measurement of
role-consistency under identity-assigned role-play β€” which is a
well-defined phenomenon but not the same as emergent or incentive-
driven deception.
**What this SAE is and is not good for.**
- **Good for:** research on mixed-pool activation geometry; SAE
feature-geometry studies; as one of a set of baselines when
comparing multiple SAE families; as a reference implementation of
same-prompt temperature-sampled behavioral SAE training at scale.
- **Not recommended as a standalone deception detector.** The
role-consistency signal from the three role-play scenarios is mixed
into every aggregate metric reported below. A downstream user who
wants an "emergent-deception feature set" should restrict attention
to features whose activation pattern concentrates in the
`insider_info` / `accounting_error` / `ai_oversight_log` /
`ai_capability_hide` / `surprise_party` / `job_interview_gap`
scenarios β€” or wait for the methodologically corrected V3 re-release
currently in preparation on the decision-incentive scenario bank
(no pre-assigned deceptive identity).
**What is unaffected by this caveat.**
- The SAE weights, reconstruction metrics (explained variance, L0,
alive features), and engineering of the training pipeline are
accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper
measure the mixed pool; the 6-scenario clean-subset re-analysis is
listed as a planned appendix for the next manuscript revision.
A companion methodology-first Gemma 4 SAE suite is in preparation using
pretraining-distribution data + a decision-incentive behavior split;
this README will be updated with a link when that release is public.
---
Part of the cross-model deception SAE study: [Solshine/deception-behavioral-saes-saelens](https://huggingface.co/Solshine/deception-behavioral-saes-saelens) (9 models, 348 total SAEs).
## What's in This Repo
- **30 SAEs** across 5 layers (L4, L8, L12, L16, L20)
- **2 architectures:** TopK (k=64), JumpReLU
- **3 training conditions:** `mixed`, `deceptive_only`, `honest_only`
- **Format:** SAELens/Neuronpedia-compatible (safetensors + cfg.json)
- **Dimensions:** d_in=2560, d_sae=10240 (4x expansion)
## Research Context
This is a follow-up to ["The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"](https://arxiv.org/abs/2509.20393) (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (`--quantize-4bit`, ~1.6 GB GPU footprint).
Code: [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research)
## Key Findings β€” Phi-2
Phi-2 is the **anomalous model** in the 9-model study β€” it violates the expected pattern for its parameter count in a way that implicates its distinctive parallel attention architecture.
| Metric | Value |
|---|---|
| Peak layer | L21 (75% depth β€” note: not a trained SAE layer) |
| Best SAE layer | L20 |
| Peak balanced accuracy | **74.9%** |
| Best SAE probe accuracy | **79.4%** (`phi2_jumprelu_L20_honest_only`) |
| SAEs beating raw baseline | 10/30 (33%) β€” partial SAE **help** |
**The parallel attention anomaly:** At 2.7B parameters, the cross-model trend predicts SAEs should hurt (as they do for Qwen3-1.7B and nanochat-d32). Instead, 33% of Phi-2's SAEs beat the raw baseline β€” a rate comparable to the sub-1.3B models where SAEs help. The leading hypothesis is Phi-2's parallel (simultaneous) attention+MLP architecture. Standard transformers route residual stream information through attention before MLP; Phi-2 runs both in parallel and sums the outputs. This parallel path may produce more concentrated, less distributed deception encoding that SAEs can decompose more effectively.
**The anomaly does not persist at 3.8B:** Phi-4-mini-reasoning (also Phi architecture, but larger) shows only 1/42 (2%) SAE help β€” standard large-model behavior. The parallel-attention effect appears to be a Phi-2-specific phenomenon that fades at scale.
**Best SAE outperforms raw by +4.5pp:** `phi2_jumprelu_L20_honest_only` achieves 79.4% vs L20 raw of 74.9%. Unlike d20/TinyLlama/Pythia160m where the SAE advantage is at a specific training condition, for Phi-2 the honest_only condition wins at the best layer β€” consistent with the JumpReLU+honest_only pattern seen across all SAE-helps models in the study.
**Near-linear layer profile:** Phi-2 shows a broad peak around L18–L20 (58–67% depth) with a 3.8pp gap between L16 (71.1%) and L18 (77.9%). This is shallower than the nanochat-d32 spike but more peaked than Phi-4-mini's plateau.
**Architecture note:** Phi-2 uses Microsoft's Phi architecture with parallel attention-MLP blocks (attention and MLP computed simultaneously from the same residual stream input, outputs summed). 32 transformer layers, 2560-dimensional residual stream. The model was trained primarily on synthetic high-quality data (textbooks, code), which may affect the nature of deception representations.
## SAE Format
Each SAE lives in a subfolder named `{sae_id}/` containing:
- `sae_weights.safetensors` β€” encoder/decoder weights
- `cfg.json` β€” SAELens-compatible config
`hook_name` format: `model.layers.{layer}.hook_resid_post`
## Training Details
| Parameter | Value |
|---|---|
| Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro |
| Training time | ~400–600 seconds per SAE |
| Epochs | 300 |
| Batch size | 128 |
| Expansion factor | 4x (2560 β†’ 10240) |
| Model quantization | 4-bit (bitsandbytes) for activation collection |
| Activations | `resid_post` collected during autoregressive generation |
| Training conditions | `mixed` (n=246), `deceptive_only` (n=88), `honest_only` (n=158) |
| LLM classifier | Gemini 2.5 Flash |
## Known Limitations
**JumpReLU threshold not learned (30 SAEs):** All SAEs have `threshold = 0` β€” functionally ReLU. L0 β‰ˆ 50% of d_sae. TopK SAEs are unaffected.
**STE fix (2026-04-11):** The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage is confirmed not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama).
**4-bit quantization:** Activations were collected from a 4-bit quantized model. The parallel-attention anomaly may be amplified or dampened by quantization effects.
**Anomaly not fully explained:** The 33% SAE-helps rate at 2.7B is not yet mechanistically explained. The parallel-attention hypothesis is plausible but not tested by ablation.
## Loading Example
```python
from safetensors.torch import load_file
import json
sae_id = "phi2_jumprelu_L20_honest_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))
# W_enc: [2560, 10240], W_dec: [10240, 2560]
# cfg["hook_name"] == "model.layers.20.hook_resid_post"
print(f"Training condition: {cfg['training_condition']}")
```
## Usage
### 1. Load an SAE from this repo
```python
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json
repo_id = "Solshine/deception-saes-phi-2"
sae_id = "phi2_topk_L20_honest_only" # replace with any tag in this repo
weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path = hf_hub_download(repo_id, f"{sae_id}/cfg.json")
with open(cfg_path) as f:
cfg = json.load(f)
# Option A β€” load with SAELens (β‰₯3.0 required for jumprelu/topk; β‰₯3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))
# Option B β€” load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [2560, 10240], b_enc [10240],
# W_dec [10240, 2560], b_dec [2560], threshold [10240]
```
### 2. Hook into the model and collect residual-stream activations
These SAEs were trained on the **residual stream after each transformer layer**.
The `hook_name` field in `cfg.json` gives the exact HuggingFace `transformers`
submodule path to hook. Phi-2 uses a parallel attention+MLP architecture (unusual). Hook path: `model.layers.{layer}`. Note: Phi-2 SAE probe results are anomalous β€” see README body for details.
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
# Read hook_name from the cfg you already loaded:
# cfg["hook_name"] == "model.layers.20" (example β€” varies by SAE)
hook_name = cfg["hook_name"] # e.g. "model.layers.20"
# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)
activations = {}
def hook_fn(module, input, output):
# Most transformer layers return (hidden_states, ...) as a tuple
h = output[0] if isinstance(output, tuple) else output
activations["resid"] = h.detach()
handle = submodule.register_forward_hook(hook_fn)
inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
model(**inputs)
handle.remove()
# activations["resid"]: [batch, seq_len, 2560]
resid = activations["resid"][:, -1, :] # last token position
```
### 3. Read feature activations
```python
with torch.no_grad():
feature_acts = sae.encode(resid) # [batch, 10240] β€” sparse
# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features = feature_acts[0].topk(10)
print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:", top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())
# Reconstruct (for sanity check β€” should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()
```
### Caveats and known limitations
**Hook names are HuggingFace `transformers`-style, not TransformerLens-style.**
The `hook_name` in `cfg.json` (e.g. `"model.layers.20"`) is a submodule path in the standard
HuggingFace model. SAELens' built-in activation-collection pipeline expects
TransformerLens hook names (e.g. `blocks.14.hook_resid_post`). This means
`SAE.from_pretrained()` with automatic model running **will not work** β€” use the
manual forward-hook pattern above instead.
**SAELens version requirements.**
- `topk` architecture: SAELens β‰₯ 3.0
- `jumprelu` architecture: SAELens β‰₯ 3.0
- `gated` architecture: SAELens β‰₯ 3.5 (or load manually with `state_dict`)
**These SAEs detect deceptive *behavior*, not deceptive *prompts**.*
They were trained on response-level activations where the same prompt produced both
deceptive and honest outputs. Feature activation differences reflect behavioral
divergence, not prompt content. See the paper for experimental design details.
## Citation
```bibtex
@article{thesecretagenda2025,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb},
journal={arXiv:2509.20393},
year={2025}
}
```