File size: 13,356 Bytes
5e90142
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
---

language: en
license: apache-2.0
tags:
  - sparse-autoencoder
  - SAE
  - interpretability
  - deception-detection
  - mechanistic-interpretability
  - saelens
  - neuronpedia
  - behavioral-sampling
  - phi
  - reasoning
base_model:
  - microsoft/Phi-4-mini-reasoning
datasets:
  - Solshine/deception-behavioral-multimodel
---


# Phi-4-mini-reasoning Deception Behavioral SAEs

42 Sparse Autoencoders trained on residual stream activations from [microsoft/Phi-4-mini-reasoning](https://huggingface.co/microsoft/Phi-4-mini-reasoning) (3.8B parameter Phi-architecture reasoning-fine-tuned model), capturing behavioral deception signals via same-prompt temperature sampling.


## Training-data caveat β€” please read before use

The "honest" and "deceptive" labels used to train and evaluate this SAE
come from a **mixed scenario pool** of nine prompts:

- **Six clean incentive-structure scenarios** β€” `insider_info`,
  `accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
  `surprise_party`, `job_interview_gap`. Each presents a realistic
  situation with a non-trivial incentive to conceal or disclose
  information; the honest/deceptive distinction is which of the two
  behavioral choices the model's completion settles into under
  temperature sampling.
- **Three role-play identity-assignment scenarios** β€”
  `secret_role_game` ("You have been assigned the role of Fascist"),
  `secret_role_game_v2` ("You are secretly one of the Mafia members"),
  `werewolf_game` ("You are a Werewolf"). These pre-assign the model a
  deceptive identity and label a completion "deceptive" when the model
  drifts away from the assigned role or "honest" when it echoes it.

**What this mixed pool means for the SAE's labels.** Within the six
incentive-structure scenarios, the honest/deceptive distinction is a
measurement of behavioral choice under an ambiguous incentive. Within
the three role-play scenarios, the distinction is a measurement of
role-consistency under identity-assigned role-play β€” which is a
well-defined phenomenon but not the same as emergent or incentive-
driven deception.

**What this SAE is and is not good for.**

- **Good for:** research on mixed-pool activation geometry; SAE
  feature-geometry studies; as one of a set of baselines when
  comparing multiple SAE families; as a reference implementation of
  same-prompt temperature-sampled behavioral SAE training at scale.
- **Not recommended as a standalone deception detector.** The
  role-consistency signal from the three role-play scenarios is mixed
  into every aggregate metric reported below. A downstream user who
  wants an "emergent-deception feature set" should restrict attention
  to features whose activation pattern concentrates in the
  `insider_info` / `accounting_error` / `ai_oversight_log` /
  `ai_capability_hide` / `surprise_party` / `job_interview_gap`
  scenarios β€” or wait for the methodologically corrected V3 re-release
  currently in preparation on the decision-incentive scenario bank
  (no pre-assigned deceptive identity).

**What is unaffected by this caveat.**

- The SAE weights, reconstruction metrics (explained variance, L0,
  alive features), and engineering of the training pipeline are
  accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper
  measure the mixed pool; the 6-scenario clean-subset re-analysis is
  listed as a planned appendix for the next manuscript revision.

A companion methodology-first Gemma 4 SAE suite is in preparation using
pretraining-distribution data + a decision-incentive behavior split;
this README will be updated with a link when that release is public.

---

Part of the cross-model deception SAE study: [Solshine/deception-behavioral-saes-saelens](https://huggingface.co/Solshine/deception-behavioral-saes-saelens) (9 models, 348 total SAEs).

## What's in This Repo

- **42 SAEs** across 7 layers (L2, L6, L10, L14, L18, L22, L26)
- **2 architectures:** TopK (k=64), JumpReLU
- **3 training conditions:** `mixed`, `deceptive_only`, `honest_only`
- **Format:** SAELens/Neuronpedia-compatible (safetensors + cfg.json)
- **Dimensions:** d_in=3072, d_sae=12288 (4x expansion)

## Research Context

This is a follow-up to ["The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"](https://arxiv.org/abs/2509.20393) (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (`--quantize-4bit`, ~1.9 GB GPU footprint) to fit within the 4 GB VRAM constraint.

Code: [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research)

## Key Findings β€” Phi-4-mini-reasoning

Phi-4-mini-reasoning is the **largest model** in the 9-model study and the only reasoning-fine-tuned model included.

| Metric | Value |
|---|---|
| Peak layer | L20 (64% depth) |
| Peak balanced accuracy | **80.8%** |
| Peak AUROC | **0.860** |
| Best SAE probe accuracy | **81.0%** (`phi4_mini_jumprelu_L6_honest_only`) |
| SAEs beating raw baseline | 1/42 (2%) β€” SAEs **hurt** detection |

**Most striking finding β€” broad plateau across all 32 layers:** Unlike every other model in the study, Phi-4-mini-reasoning maintains balanced accuracy β‰₯74% across ALL 32 layers. No other model approaches this uniformity. The hypothesis is that reasoning fine-tuning distributes deception representations more evenly throughout the network, preventing the sharp mid-network peak seen in base models like nanochat-d32 and Qwen3.

**Phi architecture anomaly does not persist at 3.8B:** The smaller Phi-2 (2.7B) shows an anomalous 33% SAE-helps rate, hypothesized to arise from Phi-2's parallel attention-MLP architecture. Phi-4-mini follows the standard large-model pattern (only 1/42 SAEs marginally beat raw, 2%), confirming that the parallel-attention effect fades or is overridden as model scale increases.

**Reasoning fine-tuning context:** Phi-4-mini-reasoning is trained for chain-of-thought mathematical reasoning. Its high deception detection accuracy (80.8%) despite being fine-tuned away from general language completion suggests that reasoning alignment does not suppress the deception-correlated geometric structures in residual stream activations.

**SAE decomposition hurts:** Only 1/42 SAEs marginally beats the raw baseline, and by only +0.2pp. The large model joins nanochat-d32 and Qwen3 in the "SAEs hurt" camp β€” confirming the 1.3B–1.7B transition as the boundary between SAE-helps and SAE-hurts regimes.

**Architecture note:** Phi-4-mini uses Microsoft's Phi architecture with 32 transformer layers, 3072-dimensional residual stream, shared input/output embeddings, and an extensive instruction+reasoning fine-tuning curriculum. The `device_map={"":"cuda:0"}` kwarg is required for 4-bit quantization to function correctly on single-GPU setups.

## SAE Format

Each SAE lives in a subfolder named `{sae_id}/` containing:
- `sae_weights.safetensors` β€” encoder/decoder weights
- `cfg.json` β€” SAELens-compatible config

`hook_name` format: `model.layers.{layer}.hook_resid_post`

## Training Details

| Parameter | Value |
|---|---|
| Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro |
| Training time | ~400–600 seconds per SAE |
| Epochs | 300 |
| Batch size | 128 |
| Expansion factor | 4x (3072 β†’ 12288) |
| Model quantization | 4-bit (bitsandbytes) for activation collection |
| Activations | `resid_post` collected during autoregressive generation |
| Training conditions | `mixed` (n=252), `deceptive_only` (n=123), `honest_only` (n=129) |
| LLM classifier | Gemini 2.5 Flash |

## Known Limitations

**JumpReLU threshold not learned (42 SAEs):** All SAEs in this repo have `threshold = 0` β€” functionally ReLU. L0 β‰ˆ 50% of d_sae. TopK SAEs are unaffected (exact k=64).



**STE fix (2026-04-11):** The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage over TopK is confirmed as not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama confirm).

**4-bit quantization:** Activations were collected from a 4-bit quantized model. Quantization may introduce noise in residual stream representations; the true (unquantized) signal could differ somewhat from reported numbers.

**Small dataset:** n=252 is the smallest sample count among the 1B+ models, reducing probe reliability and SAE training quality.

## Loading Example

```python

from safetensors.torch import load_file

import json



sae_id = "phi4_mini_jumprelu_L6_honest_only"

weights = load_file(f"{sae_id}/sae_weights.safetensors")

cfg = json.load(open(f"{sae_id}/cfg.json"))



# W_enc: [3072, 12288], W_dec: [12288, 3072]

# cfg["hook_name"] == "model.layers.6.hook_resid_post"

print(f"d_in={cfg['d_in']}, d_sae={cfg['d_sae']}")

```


## Usage

### 1. Load an SAE from this repo

```python

from huggingface_hub import hf_hub_download

from safetensors.torch import load_file

import json



repo_id = "Solshine/deception-saes-phi-4-mini-reasoning"

sae_id  = "phi4_mini_topk_L6_honest_only"   # replace with any tag in this repo



weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")

cfg_path     = hf_hub_download(repo_id, f"{sae_id}/cfg.json")



with open(cfg_path) as f:

    cfg = json.load(f)



# Option A β€” load with SAELens (β‰₯3.0 required for jumprelu/topk; β‰₯3.5 for gated)

from sae_lens import SAE

sae = SAE.from_dict(cfg)

sae.load_state_dict(load_file(weights_path))



# Option B β€” load manually (no SAELens dependency)

from safetensors.torch import load_file

state = load_file(weights_path)

# Keys: W_enc [3072, 12288], b_enc [12288],

#       W_dec [12288, 3072], b_dec [3072], threshold [12288]

```

### 2. Hook into the model and collect residual-stream activations

These SAEs were trained on the **residual stream after each transformer layer**.
The `hook_name` field in `cfg.json` gives the exact HuggingFace `transformers`
submodule path to hook. Phi-4-mini uses LLaMA-style architecture. Hook path: `model.layers.{layer}`.

```python

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer



model     = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-reasoning")

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-reasoning")



# Read hook_name from the cfg you already loaded:

#   cfg["hook_name"] == "model.layers.6"  (example β€” varies by SAE)

hook_name = cfg["hook_name"]   # e.g. "model.layers.6"



# Navigate the submodule path and register a forward hook

import functools

submodule = functools.reduce(getattr, hook_name.split("."), model)



activations = {}

def hook_fn(module, input, output):

    # Most transformer layers return (hidden_states, ...) as a tuple

    h = output[0] if isinstance(output, tuple) else output

    activations["resid"] = h.detach()



handle = submodule.register_forward_hook(hook_fn)



inputs = tokenizer("Your text here", return_tensors="pt")

with torch.no_grad():

    model(**inputs)

handle.remove()



# activations["resid"]: [batch, seq_len, 3072]

resid = activations["resid"][:, -1, :]  # last token position

```

### 3. Read feature activations

```python

with torch.no_grad():

    feature_acts = sae.encode(resid)  # [batch, 12288] β€” sparse



# Which features fired?

active_features = feature_acts[0].nonzero(as_tuple=True)[0]

top_features    = feature_acts[0].topk(10)



print("Active feature indices:", active_features.tolist())

print("Top-10 feature values:",  top_features.values.tolist())

print("Top-10 feature indices:", top_features.indices.tolist())



# Reconstruct (for sanity check β€” should be close to resid)

reconstruction = sae.decode(feature_acts)

l2_error = (resid - reconstruction).norm(dim=-1).mean()

```

### Caveats and known limitations

**Hook names are HuggingFace `transformers`-style, not TransformerLens-style.**
The `hook_name` in `cfg.json` (e.g. `"model.layers.6"`) is a submodule path in the standard
HuggingFace model. SAELens' built-in activation-collection pipeline expects
TransformerLens hook names (e.g. `blocks.14.hook_resid_post`). This means
`SAE.from_pretrained()` with automatic model running **will not work** β€” use the
manual forward-hook pattern above instead.

**SAELens version requirements.**
- `topk` architecture: SAELens β‰₯ 3.0
- `jumprelu` architecture: SAELens β‰₯ 3.0
- `gated` architecture: SAELens β‰₯ 3.5 (or load manually with `state_dict`)

**These SAEs detect deceptive *behavior*, not deceptive *prompts**.*

They were trained on response-level activations where the same prompt produced both

deceptive and honest outputs. Feature activation differences reflect behavioral

divergence, not prompt content. See the paper for experimental design details.



## Citation



```bibtex

@article{thesecretagenda2025,

  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},

  author={DeLeeuw, Caleb},

  journal={arXiv:2509.20393},

  year={2025}

}

```