File size: 13,402 Bytes
80cca10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---

language: en
license: apache-2.0
tags:
  - sparse-autoencoder
  - SAE
  - interpretability
  - deception-detection
  - mechanistic-interpretability
  - saelens
  - neuronpedia
  - behavioral-sampling
  - phi
base_model:
  - microsoft/phi-2
datasets:
  - Solshine/deception-behavioral-multimodel
---


# Phi-2 Deception Behavioral SAEs

30 Sparse Autoencoders trained on residual stream activations from [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) (2.7B parameter Phi-architecture base model with parallel attention), capturing behavioral deception signals via same-prompt temperature sampling.


## Training-data caveat β€” please read before use

The "honest" and "deceptive" labels used to train and evaluate this SAE
come from a **mixed scenario pool** of nine prompts:

- **Six clean incentive-structure scenarios** β€” `insider_info`,
  `accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
  `surprise_party`, `job_interview_gap`. Each presents a realistic
  situation with a non-trivial incentive to conceal or disclose
  information; the honest/deceptive distinction is which of the two
  behavioral choices the model's completion settles into under
  temperature sampling.
- **Three role-play identity-assignment scenarios** β€”
  `secret_role_game` ("You have been assigned the role of Fascist"),
  `secret_role_game_v2` ("You are secretly one of the Mafia members"),
  `werewolf_game` ("You are a Werewolf"). These pre-assign the model a
  deceptive identity and label a completion "deceptive" when the model
  drifts away from the assigned role or "honest" when it echoes it.

**What this mixed pool means for the SAE's labels.** Within the six
incentive-structure scenarios, the honest/deceptive distinction is a
measurement of behavioral choice under an ambiguous incentive. Within
the three role-play scenarios, the distinction is a measurement of
role-consistency under identity-assigned role-play β€” which is a
well-defined phenomenon but not the same as emergent or incentive-
driven deception.

**What this SAE is and is not good for.**

- **Good for:** research on mixed-pool activation geometry; SAE
  feature-geometry studies; as one of a set of baselines when
  comparing multiple SAE families; as a reference implementation of
  same-prompt temperature-sampled behavioral SAE training at scale.
- **Not recommended as a standalone deception detector.** The
  role-consistency signal from the three role-play scenarios is mixed
  into every aggregate metric reported below. A downstream user who
  wants an "emergent-deception feature set" should restrict attention
  to features whose activation pattern concentrates in the
  `insider_info` / `accounting_error` / `ai_oversight_log` /
  `ai_capability_hide` / `surprise_party` / `job_interview_gap`
  scenarios β€” or wait for the methodologically corrected V3 re-release
  currently in preparation on the decision-incentive scenario bank
  (no pre-assigned deceptive identity).

**What is unaffected by this caveat.**

- The SAE weights, reconstruction metrics (explained variance, L0,
  alive features), and engineering of the training pipeline are
  accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper
  measure the mixed pool; the 6-scenario clean-subset re-analysis is
  listed as a planned appendix for the next manuscript revision.

A companion methodology-first Gemma 4 SAE suite is in preparation using
pretraining-distribution data + a decision-incentive behavior split;
this README will be updated with a link when that release is public.

---

Part of the cross-model deception SAE study: [Solshine/deception-behavioral-saes-saelens](https://huggingface.co/Solshine/deception-behavioral-saes-saelens) (9 models, 348 total SAEs).

## What's in This Repo

- **30 SAEs** across 5 layers (L4, L8, L12, L16, L20)
- **2 architectures:** TopK (k=64), JumpReLU
- **3 training conditions:** `mixed`, `deceptive_only`, `honest_only`
- **Format:** SAELens/Neuronpedia-compatible (safetensors + cfg.json)
- **Dimensions:** d_in=2560, d_sae=10240 (4x expansion)

## Research Context

This is a follow-up to ["The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"](https://arxiv.org/abs/2509.20393) (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (`--quantize-4bit`, ~1.6 GB GPU footprint).

Code: [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research)

## Key Findings β€” Phi-2

Phi-2 is the **anomalous model** in the 9-model study β€” it violates the expected pattern for its parameter count in a way that implicates its distinctive parallel attention architecture.

| Metric | Value |
|---|---|
| Peak layer | L21 (75% depth β€” note: not a trained SAE layer) |
| Best SAE layer | L20 |
| Peak balanced accuracy | **74.9%** |
| Best SAE probe accuracy | **79.4%** (`phi2_jumprelu_L20_honest_only`) |
| SAEs beating raw baseline | 10/30 (33%) β€” partial SAE **help** |

**The parallel attention anomaly:** At 2.7B parameters, the cross-model trend predicts SAEs should hurt (as they do for Qwen3-1.7B and nanochat-d32). Instead, 33% of Phi-2's SAEs beat the raw baseline β€” a rate comparable to the sub-1.3B models where SAEs help. The leading hypothesis is Phi-2's parallel (simultaneous) attention+MLP architecture. Standard transformers route residual stream information through attention before MLP; Phi-2 runs both in parallel and sums the outputs. This parallel path may produce more concentrated, less distributed deception encoding that SAEs can decompose more effectively.

**The anomaly does not persist at 3.8B:** Phi-4-mini-reasoning (also Phi architecture, but larger) shows only 1/42 (2%) SAE help β€” standard large-model behavior. The parallel-attention effect appears to be a Phi-2-specific phenomenon that fades at scale.

**Best SAE outperforms raw by +4.5pp:** `phi2_jumprelu_L20_honest_only` achieves 79.4% vs L20 raw of 74.9%. Unlike d20/TinyLlama/Pythia160m where the SAE advantage is at a specific training condition, for Phi-2 the honest_only condition wins at the best layer β€” consistent with the JumpReLU+honest_only pattern seen across all SAE-helps models in the study.

**Near-linear layer profile:** Phi-2 shows a broad peak around L18–L20 (58–67% depth) with a 3.8pp gap between L16 (71.1%) and L18 (77.9%). This is shallower than the nanochat-d32 spike but more peaked than Phi-4-mini's plateau.

**Architecture note:** Phi-2 uses Microsoft's Phi architecture with parallel attention-MLP blocks (attention and MLP computed simultaneously from the same residual stream input, outputs summed). 32 transformer layers, 2560-dimensional residual stream. The model was trained primarily on synthetic high-quality data (textbooks, code), which may affect the nature of deception representations.

## SAE Format

Each SAE lives in a subfolder named `{sae_id}/` containing:
- `sae_weights.safetensors` β€” encoder/decoder weights
- `cfg.json` β€” SAELens-compatible config

`hook_name` format: `model.layers.{layer}.hook_resid_post`

## Training Details

| Parameter | Value |
|---|---|
| Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro |
| Training time | ~400–600 seconds per SAE |
| Epochs | 300 |
| Batch size | 128 |
| Expansion factor | 4x (2560 β†’ 10240) |
| Model quantization | 4-bit (bitsandbytes) for activation collection |
| Activations | `resid_post` collected during autoregressive generation |
| Training conditions | `mixed` (n=246), `deceptive_only` (n=88), `honest_only` (n=158) |
| LLM classifier | Gemini 2.5 Flash |

## Known Limitations

**JumpReLU threshold not learned (30 SAEs):** All SAEs have `threshold = 0` β€” functionally ReLU. L0 β‰ˆ 50% of d_sae. TopK SAEs are unaffected.



**STE fix (2026-04-11):** The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage is confirmed not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama).

**4-bit quantization:** Activations were collected from a 4-bit quantized model. The parallel-attention anomaly may be amplified or dampened by quantization effects.

**Anomaly not fully explained:** The 33% SAE-helps rate at 2.7B is not yet mechanistically explained. The parallel-attention hypothesis is plausible but not tested by ablation.

## Loading Example

```python

from safetensors.torch import load_file

import json



sae_id = "phi2_jumprelu_L20_honest_only"

weights = load_file(f"{sae_id}/sae_weights.safetensors")

cfg = json.load(open(f"{sae_id}/cfg.json"))



# W_enc: [2560, 10240], W_dec: [10240, 2560]

# cfg["hook_name"] == "model.layers.20.hook_resid_post"

print(f"Training condition: {cfg['training_condition']}")

```


## Usage

### 1. Load an SAE from this repo

```python

from huggingface_hub import hf_hub_download

from safetensors.torch import load_file

import json



repo_id = "Solshine/deception-saes-phi-2"

sae_id  = "phi2_topk_L20_honest_only"   # replace with any tag in this repo



weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")

cfg_path     = hf_hub_download(repo_id, f"{sae_id}/cfg.json")



with open(cfg_path) as f:

    cfg = json.load(f)



# Option A β€” load with SAELens (β‰₯3.0 required for jumprelu/topk; β‰₯3.5 for gated)

from sae_lens import SAE

sae = SAE.from_dict(cfg)

sae.load_state_dict(load_file(weights_path))



# Option B β€” load manually (no SAELens dependency)

from safetensors.torch import load_file

state = load_file(weights_path)

# Keys: W_enc [2560, 10240], b_enc [10240],

#       W_dec [10240, 2560], b_dec [2560], threshold [10240]

```

### 2. Hook into the model and collect residual-stream activations

These SAEs were trained on the **residual stream after each transformer layer**.
The `hook_name` field in `cfg.json` gives the exact HuggingFace `transformers`
submodule path to hook. Phi-2 uses a parallel attention+MLP architecture (unusual). Hook path: `model.layers.{layer}`. Note: Phi-2 SAE probe results are anomalous β€” see README body for details.

```python

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer



model     = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")



# Read hook_name from the cfg you already loaded:

#   cfg["hook_name"] == "model.layers.20"  (example β€” varies by SAE)

hook_name = cfg["hook_name"]   # e.g. "model.layers.20"



# Navigate the submodule path and register a forward hook

import functools

submodule = functools.reduce(getattr, hook_name.split("."), model)



activations = {}

def hook_fn(module, input, output):

    # Most transformer layers return (hidden_states, ...) as a tuple

    h = output[0] if isinstance(output, tuple) else output

    activations["resid"] = h.detach()



handle = submodule.register_forward_hook(hook_fn)



inputs = tokenizer("Your text here", return_tensors="pt")

with torch.no_grad():

    model(**inputs)

handle.remove()



# activations["resid"]: [batch, seq_len, 2560]

resid = activations["resid"][:, -1, :]  # last token position

```

### 3. Read feature activations

```python

with torch.no_grad():

    feature_acts = sae.encode(resid)  # [batch, 10240] β€” sparse



# Which features fired?

active_features = feature_acts[0].nonzero(as_tuple=True)[0]

top_features    = feature_acts[0].topk(10)



print("Active feature indices:", active_features.tolist())

print("Top-10 feature values:",  top_features.values.tolist())

print("Top-10 feature indices:", top_features.indices.tolist())



# Reconstruct (for sanity check β€” should be close to resid)

reconstruction = sae.decode(feature_acts)

l2_error = (resid - reconstruction).norm(dim=-1).mean()

```

### Caveats and known limitations

**Hook names are HuggingFace `transformers`-style, not TransformerLens-style.**
The `hook_name` in `cfg.json` (e.g. `"model.layers.20"`) is a submodule path in the standard
HuggingFace model. SAELens' built-in activation-collection pipeline expects
TransformerLens hook names (e.g. `blocks.14.hook_resid_post`). This means
`SAE.from_pretrained()` with automatic model running **will not work** β€” use the
manual forward-hook pattern above instead.

**SAELens version requirements.**
- `topk` architecture: SAELens β‰₯ 3.0
- `jumprelu` architecture: SAELens β‰₯ 3.0
- `gated` architecture: SAELens β‰₯ 3.5 (or load manually with `state_dict`)

**These SAEs detect deceptive *behavior*, not deceptive *prompts**.*

They were trained on response-level activations where the same prompt produced both

deceptive and honest outputs. Feature activation differences reflect behavioral

divergence, not prompt content. See the paper for experimental design details.



## Citation



```bibtex

@article{thesecretagenda2025,

  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},

  author={DeLeeuw, Caleb},

  journal={arXiv:2509.20393},

  year={2025}

}

```