File size: 13,402 Bytes

80cca10

---

language: en
license: apache-2.0
tags:
  - sparse-autoencoder
  - SAE
  - interpretability
  - deception-detection
  - mechanistic-interpretability
  - saelens
  - neuronpedia
  - behavioral-sampling
  - phi
base_model:
  - microsoft/phi-2
datasets:
  - Solshine/deception-behavioral-multimodel
---


# Phi-2 Deception Behavioral SAEs

30 Sparse Autoencoders trained on residual stream activations from [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) (2.7B parameter Phi-architecture base model with parallel attention), capturing behavioral deception signals via same-prompt temperature sampling.


## Training-data caveat — please read before use

The "honest" and "deceptive" labels used to train and evaluate this SAE
come from a **mixed scenario pool** of nine prompts:

- **Six clean incentive-structure scenarios** — `insider_info`,
  `accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
  `surprise_party`, `job_interview_gap`. Each presents a realistic
  situation with a non-trivial incentive to conceal or disclose
  information; the honest/deceptive distinction is which of the two
  behavioral choices the model's completion settles into under
  temperature sampling.
- **Three role-play identity-assignment scenarios** —
  `secret_role_game` ("You have been assigned the role of Fascist"),
  `secret_role_game_v2` ("You are secretly one of the Mafia members"),
  `werewolf_game` ("You are a Werewolf"). These pre-assign the model a
  deceptive identity and label a completion "deceptive" when the model
  drifts away from the assigned role or "honest" when it echoes it.

**What this mixed pool means for the SAE's labels.** Within the six
incentive-structure scenarios, the honest/deceptive distinction is a
measurement of behavioral choice under an ambiguous incentive. Within
the three role-play scenarios, the distinction is a measurement of
role-consistency under identity-assigned role-play — which is a
well-defined phenomenon but not the same as emergent or incentive-
driven deception.

**What this SAE is and is not good for.**

- **Good for:** research on mixed-pool activation geometry; SAE
  feature-geometry studies; as one of a set of baselines when
  comparing multiple SAE families; as a reference implementation of
  same-prompt temperature-sampled behavioral SAE training at scale.
- **Not recommended as a standalone deception detector.** The
  role-consistency signal from the three role-play scenarios is mixed
  into every aggregate metric reported below. A downstream user who
  wants an "emergent-deception feature set" should restrict attention
  to features whose activation pattern concentrates in the
  `insider_info` / `accounting_error` / `ai_oversight_log` /
  `ai_capability_hide` / `surprise_party` / `job_interview_gap`
  scenarios — or wait for the methodologically corrected V3 re-release
  currently in preparation on the decision-incentive scenario bank
  (no pre-assigned deceptive identity).

**What is unaffected by this caveat.**

- The SAE weights, reconstruction metrics (explained variance, L0,
  alive features), and engineering of the training pipeline are
  accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper
  measure the mixed pool; the 6-scenario clean-subset re-analysis is
  listed as a planned appendix for the next manuscript revision.

A companion methodology-first Gemma 4 SAE suite is in preparation using
pretraining-distribution data + a decision-incentive behavior split;
this README will be updated with a link when that release is public.

---

Part of the cross-model deception SAE study: [Solshine/deception-behavioral-saes-saelens](https://huggingface.co/Solshine/deception-behavioral-saes-saelens) (9 models, 348 total SAEs).

## What's in This Repo

- **30 SAEs** across 5 layers (L4, L8, L12, L16, L20)
- **2 architectures:** TopK (k=64), JumpReLU
- **3 training conditions:** `mixed`, `deceptive_only`, `honest_only`
- **Format:** SAELens/Neuronpedia-compatible (safetensors + cfg.json)
- **Dimensions:** d_in=2560, d_sae=10240 (4x expansion)

## Research Context

This is a follow-up to ["The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"](https://arxiv.org/abs/2509.20393) (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (`--quantize-4bit`, ~1.6 GB GPU footprint).

Code: [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research)

## Key Findings — Phi-2

Phi-2 is the **anomalous model** in the 9-model study — it violates the expected pattern for its parameter count in a way that implicates its distinctive parallel attention architecture.

| Metric | Value |
|---|---|
| Peak layer | L21 (75% depth — note: not a trained SAE layer) |
| Best SAE layer | L20 |
| Peak balanced accuracy | **74.9%** |
| Best SAE probe accuracy | **79.4%** (`phi2_jumprelu_L20_honest_only`) |
| SAEs beating raw baseline | 10/30 (33%) — partial SAE **help** |

**The parallel attention anomaly:** At 2.7B parameters, the cross-model trend predicts SAEs should hurt (as they do for Qwen3-1.7B and nanochat-d32). Instead, 33% of Phi-2's SAEs beat the raw baseline — a rate comparable to the sub-1.3B models where SAEs help. The leading hypothesis is Phi-2's parallel (simultaneous) attention+MLP architecture. Standard transformers route residual stream information through attention before MLP; Phi-2 runs both in parallel and sums the outputs. This parallel path may produce more concentrated, less distributed deception encoding that SAEs can decompose more effectively.

**The anomaly does not persist at 3.8B:** Phi-4-mini-reasoning (also Phi architecture, but larger) shows only 1/42 (2%) SAE help — standard large-model behavior. The parallel-attention effect appears to be a Phi-2-specific phenomenon that fades at scale.

**Best SAE outperforms raw by +4.5pp:** `phi2_jumprelu_L20_honest_only` achieves 79.4% vs L20 raw of 74.9%. Unlike d20/TinyLlama/Pythia160m where the SAE advantage is at a specific training condition, for Phi-2 the honest_only condition wins at the best layer — consistent with the JumpReLU+honest_only pattern seen across all SAE-helps models in the study.

**Near-linear layer profile:** Phi-2 shows a broad peak around L18–L20 (58–67% depth) with a 3.8pp gap between L16 (71.1%) and L18 (77.9%). This is shallower than the nanochat-d32 spike but more peaked than Phi-4-mini's plateau.

**Architecture note:** Phi-2 uses Microsoft's Phi architecture with parallel attention-MLP blocks (attention and MLP computed simultaneously from the same residual stream input, outputs summed). 32 transformer layers, 2560-dimensional residual stream. The model was trained primarily on synthetic high-quality data (textbooks, code), which may affect the nature of deception representations.

## SAE Format

Each SAE lives in a subfolder named `{sae_id}/` containing:
- `sae_weights.safetensors` — encoder/decoder weights
- `cfg.json` — SAELens-compatible config

`hook_name` format: `model.layers.{layer}.hook_resid_post`

## Training Details

| Parameter | Value |
|---|---|
| Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro |
| Training time | ~400–600 seconds per SAE |
| Epochs | 300 |
| Batch size | 128 |
| Expansion factor | 4x (2560 → 10240) |
| Model quantization | 4-bit (bitsandbytes) for activation collection |
| Activations | `resid_post` collected during autoregressive generation |
| Training conditions | `mixed` (n=246), `deceptive_only` (n=88), `honest_only` (n=158) |
| LLM classifier | Gemini 2.5 Flash |

## Known Limitations

**JumpReLU threshold not learned (30 SAEs):** All SAEs have `threshold = 0` — functionally ReLU. L0 ≈ 50% of d_sae. TopK SAEs are unaffected.



**STE fix (2026-04-11):** The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage is confirmed not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama).

**4-bit quantization:** Activations were collected from a 4-bit quantized model. The parallel-attention anomaly may be amplified or dampened by quantization effects.

**Anomaly not fully explained:** The 33% SAE-helps rate at 2.7B is not yet mechanistically explained. The parallel-attention hypothesis is plausible but not tested by ablation.

## Loading Example

```python

from safetensors.torch import load_file

import json



sae_id = "phi2_jumprelu_L20_honest_only"

weights = load_file(f"{sae_id}/sae_weights.safetensors")

cfg = json.load(open(f"{sae_id}/cfg.json"))



# W_enc: [2560, 10240], W_dec: [10240, 2560]

# cfg["hook_name"] == "model.layers.20.hook_resid_post"

print(f"Training condition: {cfg['training_condition']}")

```


## Usage

### 1. Load an SAE from this repo

```python

from huggingface_hub import hf_hub_download

from safetensors.torch import load_file

import json



repo_id = "Solshine/deception-saes-phi-2"

sae_id  = "phi2_topk_L20_honest_only"   # replace with any tag in this repo



weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")

cfg_path     = hf_hub_download(repo_id, f"{sae_id}/cfg.json")



with open(cfg_path) as f:

    cfg = json.load(f)



# Option A — load with SAELens (≥3.0 required for jumprelu/topk; ≥3.5 for gated)

from sae_lens import SAE

sae = SAE.from_dict(cfg)

sae.load_state_dict(load_file(weights_path))



# Option B — load manually (no SAELens dependency)

from safetensors.torch import load_file

state = load_file(weights_path)

# Keys: W_enc [2560, 10240], b_enc [10240],

#       W_dec [10240, 2560], b_dec [2560], threshold [10240]

```

### 2. Hook into the model and collect residual-stream activations

These SAEs were trained on the **residual stream after each transformer layer**.
The `hook_name` field in `cfg.json` gives the exact HuggingFace `transformers`
submodule path to hook. Phi-2 uses a parallel attention+MLP architecture (unusual). Hook path: `model.layers.{layer}`. Note: Phi-2 SAE probe results are anomalous — see README body for details.

```python

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer



model     = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")



# Read hook_name from the cfg you already loaded:

#   cfg["hook_name"] == "model.layers.20"  (example — varies by SAE)

hook_name = cfg["hook_name"]   # e.g. "model.layers.20"



# Navigate the submodule path and register a forward hook

import functools

submodule = functools.reduce(getattr, hook_name.split("."), model)



activations = {}

def hook_fn(module, input, output):

    # Most transformer layers return (hidden_states, ...) as a tuple

    h = output[0] if isinstance(output, tuple) else output

    activations["resid"] = h.detach()



handle = submodule.register_forward_hook(hook_fn)



inputs = tokenizer("Your text here", return_tensors="pt")

with torch.no_grad():

    model(**inputs)

handle.remove()



# activations["resid"]: [batch, seq_len, 2560]

resid = activations["resid"][:, -1, :]  # last token position

```

### 3. Read feature activations

```python

with torch.no_grad():

    feature_acts = sae.encode(resid)  # [batch, 10240] — sparse



# Which features fired?

active_features = feature_acts[0].nonzero(as_tuple=True)[0]

top_features    = feature_acts[0].topk(10)



print("Active feature indices:", active_features.tolist())

print("Top-10 feature values:",  top_features.values.tolist())

print("Top-10 feature indices:", top_features.indices.tolist())



# Reconstruct (for sanity check — should be close to resid)

reconstruction = sae.decode(feature_acts)

l2_error = (resid - reconstruction).norm(dim=-1).mean()

```

### Caveats and known limitations

**Hook names are HuggingFace `transformers`-style, not TransformerLens-style.**
The `hook_name` in `cfg.json` (e.g. `"model.layers.20"`) is a submodule path in the standard
HuggingFace model. SAELens' built-in activation-collection pipeline expects
TransformerLens hook names (e.g. `blocks.14.hook_resid_post`). This means
`SAE.from_pretrained()` with automatic model running **will not work** — use the
manual forward-hook pattern above instead.

**SAELens version requirements.**
- `topk` architecture: SAELens ≥ 3.0
- `jumprelu` architecture: SAELens ≥ 3.0
- `gated` architecture: SAELens ≥ 3.5 (or load manually with `state_dict`)

**These SAEs detect deceptive *behavior*, not deceptive *prompts**.*

They were trained on response-level activations where the same prompt produced both

deceptive and honest outputs. Feature activation differences reflect behavioral

divergence, not prompt content. See the paper for experimental design details.



## Citation



```bibtex

@article{thesecretagenda2025,

  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},

  author={DeLeeuw, Caleb},

  journal={arXiv:2509.20393},

  year={2025}

}

```