File size: 13,356 Bytes

5e90142

---

language: en
license: apache-2.0
tags:
  - sparse-autoencoder
  - SAE
  - interpretability
  - deception-detection
  - mechanistic-interpretability
  - saelens
  - neuronpedia
  - behavioral-sampling
  - phi
  - reasoning
base_model:
  - microsoft/Phi-4-mini-reasoning
datasets:
  - Solshine/deception-behavioral-multimodel
---


# Phi-4-mini-reasoning Deception Behavioral SAEs

42 Sparse Autoencoders trained on residual stream activations from [microsoft/Phi-4-mini-reasoning](https://huggingface.co/microsoft/Phi-4-mini-reasoning) (3.8B parameter Phi-architecture reasoning-fine-tuned model), capturing behavioral deception signals via same-prompt temperature sampling.


## Training-data caveat — please read before use

The "honest" and "deceptive" labels used to train and evaluate this SAE
come from a **mixed scenario pool** of nine prompts:

- **Six clean incentive-structure scenarios** — `insider_info`,
  `accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
  `surprise_party`, `job_interview_gap`. Each presents a realistic
  situation with a non-trivial incentive to conceal or disclose
  information; the honest/deceptive distinction is which of the two
  behavioral choices the model's completion settles into under
  temperature sampling.
- **Three role-play identity-assignment scenarios** —
  `secret_role_game` ("You have been assigned the role of Fascist"),
  `secret_role_game_v2` ("You are secretly one of the Mafia members"),
  `werewolf_game` ("You are a Werewolf"). These pre-assign the model a
  deceptive identity and label a completion "deceptive" when the model
  drifts away from the assigned role or "honest" when it echoes it.

**What this mixed pool means for the SAE's labels.** Within the six
incentive-structure scenarios, the honest/deceptive distinction is a
measurement of behavioral choice under an ambiguous incentive. Within
the three role-play scenarios, the distinction is a measurement of
role-consistency under identity-assigned role-play — which is a
well-defined phenomenon but not the same as emergent or incentive-
driven deception.

**What this SAE is and is not good for.**

- **Good for:** research on mixed-pool activation geometry; SAE
  feature-geometry studies; as one of a set of baselines when
  comparing multiple SAE families; as a reference implementation of
  same-prompt temperature-sampled behavioral SAE training at scale.
- **Not recommended as a standalone deception detector.** The
  role-consistency signal from the three role-play scenarios is mixed
  into every aggregate metric reported below. A downstream user who
  wants an "emergent-deception feature set" should restrict attention
  to features whose activation pattern concentrates in the
  `insider_info` / `accounting_error` / `ai_oversight_log` /
  `ai_capability_hide` / `surprise_party` / `job_interview_gap`
  scenarios — or wait for the methodologically corrected V3 re-release
  currently in preparation on the decision-incentive scenario bank
  (no pre-assigned deceptive identity).

**What is unaffected by this caveat.**

- The SAE weights, reconstruction metrics (explained variance, L0,
  alive features), and engineering of the training pipeline are
  accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper
  measure the mixed pool; the 6-scenario clean-subset re-analysis is
  listed as a planned appendix for the next manuscript revision.

A companion methodology-first Gemma 4 SAE suite is in preparation using
pretraining-distribution data + a decision-incentive behavior split;
this README will be updated with a link when that release is public.

---

Part of the cross-model deception SAE study: [Solshine/deception-behavioral-saes-saelens](https://huggingface.co/Solshine/deception-behavioral-saes-saelens) (9 models, 348 total SAEs).

## What's in This Repo

- **42 SAEs** across 7 layers (L2, L6, L10, L14, L18, L22, L26)
- **2 architectures:** TopK (k=64), JumpReLU
- **3 training conditions:** `mixed`, `deceptive_only`, `honest_only`
- **Format:** SAELens/Neuronpedia-compatible (safetensors + cfg.json)
- **Dimensions:** d_in=3072, d_sae=12288 (4x expansion)

## Research Context

This is a follow-up to ["The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"](https://arxiv.org/abs/2509.20393) (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (`--quantize-4bit`, ~1.9 GB GPU footprint) to fit within the 4 GB VRAM constraint.

Code: [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research)

## Key Findings — Phi-4-mini-reasoning

Phi-4-mini-reasoning is the **largest model** in the 9-model study and the only reasoning-fine-tuned model included.

| Metric | Value |
|---|---|
| Peak layer | L20 (64% depth) |
| Peak balanced accuracy | **80.8%** |
| Peak AUROC | **0.860** |
| Best SAE probe accuracy | **81.0%** (`phi4_mini_jumprelu_L6_honest_only`) |
| SAEs beating raw baseline | 1/42 (2%) — SAEs **hurt** detection |

**Most striking finding — broad plateau across all 32 layers:** Unlike every other model in the study, Phi-4-mini-reasoning maintains balanced accuracy ≥74% across ALL 32 layers. No other model approaches this uniformity. The hypothesis is that reasoning fine-tuning distributes deception representations more evenly throughout the network, preventing the sharp mid-network peak seen in base models like nanochat-d32 and Qwen3.

**Phi architecture anomaly does not persist at 3.8B:** The smaller Phi-2 (2.7B) shows an anomalous 33% SAE-helps rate, hypothesized to arise from Phi-2's parallel attention-MLP architecture. Phi-4-mini follows the standard large-model pattern (only 1/42 SAEs marginally beat raw, 2%), confirming that the parallel-attention effect fades or is overridden as model scale increases.

**Reasoning fine-tuning context:** Phi-4-mini-reasoning is trained for chain-of-thought mathematical reasoning. Its high deception detection accuracy (80.8%) despite being fine-tuned away from general language completion suggests that reasoning alignment does not suppress the deception-correlated geometric structures in residual stream activations.

**SAE decomposition hurts:** Only 1/42 SAEs marginally beats the raw baseline, and by only +0.2pp. The large model joins nanochat-d32 and Qwen3 in the "SAEs hurt" camp — confirming the 1.3B–1.7B transition as the boundary between SAE-helps and SAE-hurts regimes.

**Architecture note:** Phi-4-mini uses Microsoft's Phi architecture with 32 transformer layers, 3072-dimensional residual stream, shared input/output embeddings, and an extensive instruction+reasoning fine-tuning curriculum. The `device_map={"":"cuda:0"}` kwarg is required for 4-bit quantization to function correctly on single-GPU setups.

## SAE Format

Each SAE lives in a subfolder named `{sae_id}/` containing:
- `sae_weights.safetensors` — encoder/decoder weights
- `cfg.json` — SAELens-compatible config

`hook_name` format: `model.layers.{layer}.hook_resid_post`

## Training Details

| Parameter | Value |
|---|---|
| Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro |
| Training time | ~400–600 seconds per SAE |
| Epochs | 300 |
| Batch size | 128 |
| Expansion factor | 4x (3072 → 12288) |
| Model quantization | 4-bit (bitsandbytes) for activation collection |
| Activations | `resid_post` collected during autoregressive generation |
| Training conditions | `mixed` (n=252), `deceptive_only` (n=123), `honest_only` (n=129) |
| LLM classifier | Gemini 2.5 Flash |

## Known Limitations

**JumpReLU threshold not learned (42 SAEs):** All SAEs in this repo have `threshold = 0` — functionally ReLU. L0 ≈ 50% of d_sae. TopK SAEs are unaffected (exact k=64).



**STE fix (2026-04-11):** The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage over TopK is confirmed as not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama confirm).

**4-bit quantization:** Activations were collected from a 4-bit quantized model. Quantization may introduce noise in residual stream representations; the true (unquantized) signal could differ somewhat from reported numbers.

**Small dataset:** n=252 is the smallest sample count among the 1B+ models, reducing probe reliability and SAE training quality.

## Loading Example

```python

from safetensors.torch import load_file

import json



sae_id = "phi4_mini_jumprelu_L6_honest_only"

weights = load_file(f"{sae_id}/sae_weights.safetensors")

cfg = json.load(open(f"{sae_id}/cfg.json"))



# W_enc: [3072, 12288], W_dec: [12288, 3072]

# cfg["hook_name"] == "model.layers.6.hook_resid_post"

print(f"d_in={cfg['d_in']}, d_sae={cfg['d_sae']}")

```


## Usage

### 1. Load an SAE from this repo

```python

from huggingface_hub import hf_hub_download

from safetensors.torch import load_file

import json



repo_id = "Solshine/deception-saes-phi-4-mini-reasoning"

sae_id  = "phi4_mini_topk_L6_honest_only"   # replace with any tag in this repo



weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")

cfg_path     = hf_hub_download(repo_id, f"{sae_id}/cfg.json")



with open(cfg_path) as f:

    cfg = json.load(f)



# Option A — load with SAELens (≥3.0 required for jumprelu/topk; ≥3.5 for gated)

from sae_lens import SAE

sae = SAE.from_dict(cfg)

sae.load_state_dict(load_file(weights_path))



# Option B — load manually (no SAELens dependency)

from safetensors.torch import load_file

state = load_file(weights_path)

# Keys: W_enc [3072, 12288], b_enc [12288],

#       W_dec [12288, 3072], b_dec [3072], threshold [12288]

```

### 2. Hook into the model and collect residual-stream activations

These SAEs were trained on the **residual stream after each transformer layer**.
The `hook_name` field in `cfg.json` gives the exact HuggingFace `transformers`
submodule path to hook. Phi-4-mini uses LLaMA-style architecture. Hook path: `model.layers.{layer}`.

```python

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer



model     = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-reasoning")

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-reasoning")



# Read hook_name from the cfg you already loaded:

#   cfg["hook_name"] == "model.layers.6"  (example — varies by SAE)

hook_name = cfg["hook_name"]   # e.g. "model.layers.6"



# Navigate the submodule path and register a forward hook

import functools

submodule = functools.reduce(getattr, hook_name.split("."), model)



activations = {}

def hook_fn(module, input, output):

    # Most transformer layers return (hidden_states, ...) as a tuple

    h = output[0] if isinstance(output, tuple) else output

    activations["resid"] = h.detach()



handle = submodule.register_forward_hook(hook_fn)



inputs = tokenizer("Your text here", return_tensors="pt")

with torch.no_grad():

    model(**inputs)

handle.remove()



# activations["resid"]: [batch, seq_len, 3072]

resid = activations["resid"][:, -1, :]  # last token position

```

### 3. Read feature activations

```python

with torch.no_grad():

    feature_acts = sae.encode(resid)  # [batch, 12288] — sparse



# Which features fired?

active_features = feature_acts[0].nonzero(as_tuple=True)[0]

top_features    = feature_acts[0].topk(10)



print("Active feature indices:", active_features.tolist())

print("Top-10 feature values:",  top_features.values.tolist())

print("Top-10 feature indices:", top_features.indices.tolist())



# Reconstruct (for sanity check — should be close to resid)

reconstruction = sae.decode(feature_acts)

l2_error = (resid - reconstruction).norm(dim=-1).mean()

```

### Caveats and known limitations

**Hook names are HuggingFace `transformers`-style, not TransformerLens-style.**
The `hook_name` in `cfg.json` (e.g. `"model.layers.6"`) is a submodule path in the standard
HuggingFace model. SAELens' built-in activation-collection pipeline expects
TransformerLens hook names (e.g. `blocks.14.hook_resid_post`). This means
`SAE.from_pretrained()` with automatic model running **will not work** — use the
manual forward-hook pattern above instead.

**SAELens version requirements.**
- `topk` architecture: SAELens ≥ 3.0
- `jumprelu` architecture: SAELens ≥ 3.0
- `gated` architecture: SAELens ≥ 3.5 (or load manually with `state_dict`)

**These SAEs detect deceptive *behavior*, not deceptive *prompts**.*

They were trained on response-level activations where the same prompt produced both

deceptive and honest outputs. Feature activation differences reflect behavioral

divergence, not prompt content. See the paper for experimental design details.



## Citation



```bibtex

@article{thesecretagenda2025,

  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},

  author={DeLeeuw, Caleb},

  journal={arXiv:2509.20393},

  year={2025}

}

```