Initial public release: SAE weights, cfg, and model card

80cca10 28 days ago

13.4 kB

	---
	language: en
	license: apache-2.0
	tags:
	- sparse-autoencoder
	- SAE
	- interpretability
	- deception-detection
	- mechanistic-interpretability
	- saelens
	- neuronpedia
	- behavioral-sampling
	- phi
	base_model:
	- microsoft/phi-2
	datasets:
	- Solshine/deception-behavioral-multimodel
	---

	# Phi-2 Deception Behavioral SAEs

	30 Sparse Autoencoders trained on residual stream activations from [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) (2.7B parameter Phi-architecture base model with parallel attention), capturing behavioral deception signals via same-prompt temperature sampling.


	## Training-data caveat — please read before use

	The "honest" and "deceptive" labels used to train and evaluate this SAE
	come from a mixed scenario pool of nine prompts:

	- Six clean incentive-structure scenarios — `insider_info`,
	`accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
	`surprise_party`, `job_interview_gap`. Each presents a realistic
	situation with a non-trivial incentive to conceal or disclose
	information; the honest/deceptive distinction is which of the two
	behavioral choices the model's completion settles into under
	temperature sampling.
	- Three role-play identity-assignment scenarios —
	`secret_role_game` ("You have been assigned the role of Fascist"),
	`secret_role_game_v2` ("You are secretly one of the Mafia members"),
	`werewolf_game` ("You are a Werewolf"). These pre-assign the model a
	deceptive identity and label a completion "deceptive" when the model
	drifts away from the assigned role or "honest" when it echoes it.

	What this mixed pool means for the SAE's labels. Within the six
	incentive-structure scenarios, the honest/deceptive distinction is a
	measurement of behavioral choice under an ambiguous incentive. Within
	the three role-play scenarios, the distinction is a measurement of
	role-consistency under identity-assigned role-play — which is a
	well-defined phenomenon but not the same as emergent or incentive-
	driven deception.

	What this SAE is and is not good for.

	- Good for: research on mixed-pool activation geometry; SAE
	feature-geometry studies; as one of a set of baselines when
	comparing multiple SAE families; as a reference implementation of
	same-prompt temperature-sampled behavioral SAE training at scale.
	- Not recommended as a standalone deception detector. The
	role-consistency signal from the three role-play scenarios is mixed
	into every aggregate metric reported below. A downstream user who
	wants an "emergent-deception feature set" should restrict attention
	to features whose activation pattern concentrates in the
	`insider_info` / `accounting_error` / `ai_oversight_log` /
	`ai_capability_hide` / `surprise_party` / `job_interview_gap`
	scenarios — or wait for the methodologically corrected V3 re-release
	currently in preparation on the decision-incentive scenario bank
	(no pre-assigned deceptive identity).

	What is unaffected by this caveat.

	- The SAE weights, reconstruction metrics (explained variance, L0,
	alive features), and engineering of the training pipeline are
	accurate as reported.
	- The linear-probe balanced-accuracy numbers in the upstream paper
	measure the mixed pool; the 6-scenario clean-subset re-analysis is
	listed as a planned appendix for the next manuscript revision.

	A companion methodology-first Gemma 4 SAE suite is in preparation using
	pretraining-distribution data + a decision-incentive behavior split;
	this README will be updated with a link when that release is public.

	---

	Part of the cross-model deception SAE study: [Solshine/deception-behavioral-saes-saelens](https://huggingface.co/Solshine/deception-behavioral-saes-saelens) (9 models, 348 total SAEs).

	## What's in This Repo

	- 30 SAEs across 5 layers (L4, L8, L12, L16, L20)
	- 2 architectures: TopK (k=64), JumpReLU
	- 3 training conditions: `mixed`, `deceptive_only`, `honest_only`
	- Format: SAELens/Neuronpedia-compatible (safetensors + cfg.json)
	- Dimensions: d_in=2560, d_sae=10240 (4x expansion)

	## Research Context

	This is a follow-up to ["The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"](https://arxiv.org/abs/2509.20393) (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (`--quantize-4bit`, ~1.6 GB GPU footprint).

	Code: [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research)

	## Key Findings — Phi-2

	Phi-2 is the anomalous model in the 9-model study — it violates the expected pattern for its parameter count in a way that implicates its distinctive parallel attention architecture.

	\| Metric \| Value \|
	\|---\|---\|
	\| Peak layer \| L21 (75% depth — note: not a trained SAE layer) \|
	\| Best SAE layer \| L20 \|
	\| Peak balanced accuracy \| 74.9% \|
	\| Best SAE probe accuracy \| 79.4% (`phi2_jumprelu_L20_honest_only`) \|
	\| SAEs beating raw baseline \| 10/30 (33%) — partial SAE help \|

	The parallel attention anomaly: At 2.7B parameters, the cross-model trend predicts SAEs should hurt (as they do for Qwen3-1.7B and nanochat-d32). Instead, 33% of Phi-2's SAEs beat the raw baseline — a rate comparable to the sub-1.3B models where SAEs help. The leading hypothesis is Phi-2's parallel (simultaneous) attention+MLP architecture. Standard transformers route residual stream information through attention before MLP; Phi-2 runs both in parallel and sums the outputs. This parallel path may produce more concentrated, less distributed deception encoding that SAEs can decompose more effectively.

	The anomaly does not persist at 3.8B: Phi-4-mini-reasoning (also Phi architecture, but larger) shows only 1/42 (2%) SAE help — standard large-model behavior. The parallel-attention effect appears to be a Phi-2-specific phenomenon that fades at scale.

	Best SAE outperforms raw by +4.5pp: `phi2_jumprelu_L20_honest_only` achieves 79.4% vs L20 raw of 74.9%. Unlike d20/TinyLlama/Pythia160m where the SAE advantage is at a specific training condition, for Phi-2 the honest_only condition wins at the best layer — consistent with the JumpReLU+honest_only pattern seen across all SAE-helps models in the study.

	Near-linear layer profile: Phi-2 shows a broad peak around L18–L20 (58–67% depth) with a 3.8pp gap between L16 (71.1%) and L18 (77.9%). This is shallower than the nanochat-d32 spike but more peaked than Phi-4-mini's plateau.

	Architecture note: Phi-2 uses Microsoft's Phi architecture with parallel attention-MLP blocks (attention and MLP computed simultaneously from the same residual stream input, outputs summed). 32 transformer layers, 2560-dimensional residual stream. The model was trained primarily on synthetic high-quality data (textbooks, code), which may affect the nature of deception representations.

	## SAE Format

	Each SAE lives in a subfolder named `{sae_id}/` containing:
	- `sae_weights.safetensors` — encoder/decoder weights
	- `cfg.json` — SAELens-compatible config

	`hook_name` format: `model.layers.{layer}.hook_resid_post`

	## Training Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Hardware \| NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro \|
	\| Training time \| ~400–600 seconds per SAE \|
	\| Epochs \| 300 \|
	\| Batch size \| 128 \|
	\| Expansion factor \| 4x (2560 → 10240) \|
	\| Model quantization \| 4-bit (bitsandbytes) for activation collection \|
	\| Activations \| `resid_post` collected during autoregressive generation \|
	\| Training conditions \| `mixed` (n=246), `deceptive_only` (n=88), `honest_only` (n=158) \|
	\| LLM classifier \| Gemini 2.5 Flash \|

	## Known Limitations

	JumpReLU threshold not learned (30 SAEs): All SAEs have `threshold = 0` — functionally ReLU. L0 ≈ 50% of d_sae. TopK SAEs are unaffected.

	STE fix (2026-04-11): The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage is confirmed not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama).

	4-bit quantization: Activations were collected from a 4-bit quantized model. The parallel-attention anomaly may be amplified or dampened by quantization effects.

	Anomaly not fully explained: The 33% SAE-helps rate at 2.7B is not yet mechanistically explained. The parallel-attention hypothesis is plausible but not tested by ablation.

	## Loading Example

	```python
	from safetensors.torch import load_file
	import json

	sae_id = "phi2_jumprelu_L20_honest_only"
	weights = load_file(f"{sae_id}/sae_weights.safetensors")
	cfg = json.load(open(f"{sae_id}/cfg.json"))

	# W_enc: [2560, 10240], W_dec: [10240, 2560]
	# cfg["hook_name"] == "model.layers.20.hook_resid_post"
	print(f"Training condition: {cfg['training_condition']}")
	```


	## Usage

	### 1. Load an SAE from this repo

	```python
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	import json

	repo_id = "Solshine/deception-saes-phi-2"
	sae_id = "phi2_topk_L20_honest_only" # replace with any tag in this repo

	weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
	cfg_path = hf_hub_download(repo_id, f"{sae_id}/cfg.json")

	with open(cfg_path) as f:
	cfg = json.load(f)

	# Option A — load with SAELens (≥3.0 required for jumprelu/topk; ≥3.5 for gated)
	from sae_lens import SAE
	sae = SAE.from_dict(cfg)
	sae.load_state_dict(load_file(weights_path))

	# Option B — load manually (no SAELens dependency)
	from safetensors.torch import load_file
	state = load_file(weights_path)
	# Keys: W_enc [2560, 10240], b_enc [10240],
	# W_dec [10240, 2560], b_dec [2560], threshold [10240]
	```

	### 2. Hook into the model and collect residual-stream activations

	These SAEs were trained on the residual stream after each transformer layer.
	The `hook_name` field in `cfg.json` gives the exact HuggingFace `transformers`
	submodule path to hook. Phi-2 uses a parallel attention+MLP architecture (unusual). Hook path: `model.layers.{layer}`. Note: Phi-2 SAE probe results are anomalous — see README body for details.

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2")
	tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")

	# Read hook_name from the cfg you already loaded:
	# cfg["hook_name"] == "model.layers.20" (example — varies by SAE)
	hook_name = cfg["hook_name"] # e.g. "model.layers.20"

	# Navigate the submodule path and register a forward hook
	import functools
	submodule = functools.reduce(getattr, hook_name.split("."), model)

	activations = {}
	def hook_fn(module, input, output):
	# Most transformer layers return (hidden_states, ...) as a tuple
	h = output[0] if isinstance(output, tuple) else output
	activations["resid"] = h.detach()

	handle = submodule.register_forward_hook(hook_fn)

	inputs = tokenizer("Your text here", return_tensors="pt")
	with torch.no_grad():
	model(**inputs)
	handle.remove()

	# activations["resid"]: [batch, seq_len, 2560]
	resid = activations["resid"][:, -1, :] # last token position
	```

	### 3. Read feature activations

	```python
	with torch.no_grad():
	feature_acts = sae.encode(resid) # [batch, 10240] — sparse

	# Which features fired?
	active_features = feature_acts[0].nonzero(as_tuple=True)[0]
	top_features = feature_acts[0].topk(10)

	print("Active feature indices:", active_features.tolist())
	print("Top-10 feature values:", top_features.values.tolist())
	print("Top-10 feature indices:", top_features.indices.tolist())

	# Reconstruct (for sanity check — should be close to resid)
	reconstruction = sae.decode(feature_acts)
	l2_error = (resid - reconstruction).norm(dim=-1).mean()
	```

	### Caveats and known limitations

	Hook names are HuggingFace `transformers`-style, not TransformerLens-style.
	The `hook_name` in `cfg.json` (e.g. `"model.layers.20"`) is a submodule path in the standard
	HuggingFace model. SAELens' built-in activation-collection pipeline expects
	TransformerLens hook names (e.g. `blocks.14.hook_resid_post`). This means
	`SAE.from_pretrained()` with automatic model running will not work — use the
	manual forward-hook pattern above instead.

	SAELens version requirements.
	- `topk` architecture: SAELens ≥ 3.0
	- `jumprelu` architecture: SAELens ≥ 3.0
	- `gated` architecture: SAELens ≥ 3.5 (or load manually with `state_dict`)

	*These SAEs detect deceptive behavior, not deceptive prompts*.
	They were trained on response-level activations where the same prompt produced both
	deceptive and honest outputs. Feature activation differences reflect behavioral
	divergence, not prompt content. See the paper for experimental design details.

	## Citation

	```bibtex
	@article{thesecretagenda2025,
	title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
	author={DeLeeuw, Caleb},
	journal={arXiv:2509.20393},
	year={2025}
	}
	```