Initial public release: SAE weights, cfg, and model card

5e90142 about 1 month ago

13.4 kB

	---
	language: en
	license: apache-2.0
	tags:
	- sparse-autoencoder
	- SAE
	- interpretability
	- deception-detection
	- mechanistic-interpretability
	- saelens
	- neuronpedia
	- behavioral-sampling
	- phi
	- reasoning
	base_model:
	- microsoft/Phi-4-mini-reasoning
	datasets:
	- Solshine/deception-behavioral-multimodel
	---

	# Phi-4-mini-reasoning Deception Behavioral SAEs

	42 Sparse Autoencoders trained on residual stream activations from [microsoft/Phi-4-mini-reasoning](https://huggingface.co/microsoft/Phi-4-mini-reasoning) (3.8B parameter Phi-architecture reasoning-fine-tuned model), capturing behavioral deception signals via same-prompt temperature sampling.


	## Training-data caveat — please read before use

	The "honest" and "deceptive" labels used to train and evaluate this SAE
	come from a mixed scenario pool of nine prompts:

	- Six clean incentive-structure scenarios — `insider_info`,
	`accounting_error`, `ai_oversight_log`, `ai_capability_hide`,
	`surprise_party`, `job_interview_gap`. Each presents a realistic
	situation with a non-trivial incentive to conceal or disclose
	information; the honest/deceptive distinction is which of the two
	behavioral choices the model's completion settles into under
	temperature sampling.
	- Three role-play identity-assignment scenarios —
	`secret_role_game` ("You have been assigned the role of Fascist"),
	`secret_role_game_v2` ("You are secretly one of the Mafia members"),
	`werewolf_game` ("You are a Werewolf"). These pre-assign the model a
	deceptive identity and label a completion "deceptive" when the model
	drifts away from the assigned role or "honest" when it echoes it.

	What this mixed pool means for the SAE's labels. Within the six
	incentive-structure scenarios, the honest/deceptive distinction is a
	measurement of behavioral choice under an ambiguous incentive. Within
	the three role-play scenarios, the distinction is a measurement of
	role-consistency under identity-assigned role-play — which is a
	well-defined phenomenon but not the same as emergent or incentive-
	driven deception.

	What this SAE is and is not good for.

	- Good for: research on mixed-pool activation geometry; SAE
	feature-geometry studies; as one of a set of baselines when
	comparing multiple SAE families; as a reference implementation of
	same-prompt temperature-sampled behavioral SAE training at scale.
	- Not recommended as a standalone deception detector. The
	role-consistency signal from the three role-play scenarios is mixed
	into every aggregate metric reported below. A downstream user who
	wants an "emergent-deception feature set" should restrict attention
	to features whose activation pattern concentrates in the
	`insider_info` / `accounting_error` / `ai_oversight_log` /
	`ai_capability_hide` / `surprise_party` / `job_interview_gap`
	scenarios — or wait for the methodologically corrected V3 re-release
	currently in preparation on the decision-incentive scenario bank
	(no pre-assigned deceptive identity).

	What is unaffected by this caveat.

	- The SAE weights, reconstruction metrics (explained variance, L0,
	alive features), and engineering of the training pipeline are
	accurate as reported.
	- The linear-probe balanced-accuracy numbers in the upstream paper
	measure the mixed pool; the 6-scenario clean-subset re-analysis is
	listed as a planned appendix for the next manuscript revision.

	A companion methodology-first Gemma 4 SAE suite is in preparation using
	pretraining-distribution data + a decision-incentive behavior split;
	this README will be updated with a link when that release is public.

	---

	Part of the cross-model deception SAE study: [Solshine/deception-behavioral-saes-saelens](https://huggingface.co/Solshine/deception-behavioral-saes-saelens) (9 models, 348 total SAEs).

	## What's in This Repo

	- 42 SAEs across 7 layers (L2, L6, L10, L14, L18, L22, L26)
	- 2 architectures: TopK (k=64), JumpReLU
	- 3 training conditions: `mixed`, `deceptive_only`, `honest_only`
	- Format: SAELens/Neuronpedia-compatible (safetensors + cfg.json)
	- Dimensions: d_in=3072, d_sae=12288 (4x expansion)

	## Research Context

	This is a follow-up to ["The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools"](https://arxiv.org/abs/2509.20393) (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash. Model was run with 4-bit quantization (`--quantize-4bit`, ~1.9 GB GPU footprint) to fit within the 4 GB VRAM constraint.

	Code: [SolshineCode/deception-nanochat-sae-research](https://github.com/SolshineCode/deception-nanochat-sae-research)

	## Key Findings — Phi-4-mini-reasoning

	Phi-4-mini-reasoning is the largest model in the 9-model study and the only reasoning-fine-tuned model included.

	\| Metric \| Value \|
	\|---\|---\|
	\| Peak layer \| L20 (64% depth) \|
	\| Peak balanced accuracy \| 80.8% \|
	\| Peak AUROC \| 0.860 \|
	\| Best SAE probe accuracy \| 81.0% (`phi4_mini_jumprelu_L6_honest_only`) \|
	\| SAEs beating raw baseline \| 1/42 (2%) — SAEs hurt detection \|

	Most striking finding — broad plateau across all 32 layers: Unlike every other model in the study, Phi-4-mini-reasoning maintains balanced accuracy ≥74% across ALL 32 layers. No other model approaches this uniformity. The hypothesis is that reasoning fine-tuning distributes deception representations more evenly throughout the network, preventing the sharp mid-network peak seen in base models like nanochat-d32 and Qwen3.

	Phi architecture anomaly does not persist at 3.8B: The smaller Phi-2 (2.7B) shows an anomalous 33% SAE-helps rate, hypothesized to arise from Phi-2's parallel attention-MLP architecture. Phi-4-mini follows the standard large-model pattern (only 1/42 SAEs marginally beat raw, 2%), confirming that the parallel-attention effect fades or is overridden as model scale increases.

	Reasoning fine-tuning context: Phi-4-mini-reasoning is trained for chain-of-thought mathematical reasoning. Its high deception detection accuracy (80.8%) despite being fine-tuned away from general language completion suggests that reasoning alignment does not suppress the deception-correlated geometric structures in residual stream activations.

	SAE decomposition hurts: Only 1/42 SAEs marginally beats the raw baseline, and by only +0.2pp. The large model joins nanochat-d32 and Qwen3 in the "SAEs hurt" camp — confirming the 1.3B–1.7B transition as the boundary between SAE-helps and SAE-hurts regimes.

	Architecture note: Phi-4-mini uses Microsoft's Phi architecture with 32 transformer layers, 3072-dimensional residual stream, shared input/output embeddings, and an extensive instruction+reasoning fine-tuning curriculum. The `device_map={"":"cuda:0"}` kwarg is required for 4-bit quantization to function correctly on single-GPU setups.

	## SAE Format

	Each SAE lives in a subfolder named `{sae_id}/` containing:
	- `sae_weights.safetensors` — encoder/decoder weights
	- `cfg.json` — SAELens-compatible config

	`hook_name` format: `model.layers.{layer}.hook_resid_post`

	## Training Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Hardware \| NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro \|
	\| Training time \| ~400–600 seconds per SAE \|
	\| Epochs \| 300 \|
	\| Batch size \| 128 \|
	\| Expansion factor \| 4x (3072 → 12288) \|
	\| Model quantization \| 4-bit (bitsandbytes) for activation collection \|
	\| Activations \| `resid_post` collected during autoregressive generation \|
	\| Training conditions \| `mixed` (n=252), `deceptive_only` (n=123), `honest_only` (n=129) \|
	\| LLM classifier \| Gemini 2.5 Flash \|

	## Known Limitations

	JumpReLU threshold not learned (42 SAEs): All SAEs in this repo have `threshold = 0` — functionally ReLU. L0 ≈ 50% of d_sae. TopK SAEs are unaffected (exact k=64).

	STE fix (2026-04-11): The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage over TopK is confirmed as not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama confirm).

	4-bit quantization: Activations were collected from a 4-bit quantized model. Quantization may introduce noise in residual stream representations; the true (unquantized) signal could differ somewhat from reported numbers.

	Small dataset: n=252 is the smallest sample count among the 1B+ models, reducing probe reliability and SAE training quality.

	## Loading Example

	```python
	from safetensors.torch import load_file
	import json

	sae_id = "phi4_mini_jumprelu_L6_honest_only"
	weights = load_file(f"{sae_id}/sae_weights.safetensors")
	cfg = json.load(open(f"{sae_id}/cfg.json"))

	# W_enc: [3072, 12288], W_dec: [12288, 3072]
	# cfg["hook_name"] == "model.layers.6.hook_resid_post"
	print(f"d_in={cfg['d_in']}, d_sae={cfg['d_sae']}")
	```


	## Usage

	### 1. Load an SAE from this repo

	```python
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	import json

	repo_id = "Solshine/deception-saes-phi-4-mini-reasoning"
	sae_id = "phi4_mini_topk_L6_honest_only" # replace with any tag in this repo

	weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
	cfg_path = hf_hub_download(repo_id, f"{sae_id}/cfg.json")

	with open(cfg_path) as f:
	cfg = json.load(f)

	# Option A — load with SAELens (≥3.0 required for jumprelu/topk; ≥3.5 for gated)
	from sae_lens import SAE
	sae = SAE.from_dict(cfg)
	sae.load_state_dict(load_file(weights_path))

	# Option B — load manually (no SAELens dependency)
	from safetensors.torch import load_file
	state = load_file(weights_path)
	# Keys: W_enc [3072, 12288], b_enc [12288],
	# W_dec [12288, 3072], b_dec [3072], threshold [12288]
	```

	### 2. Hook into the model and collect residual-stream activations

	These SAEs were trained on the residual stream after each transformer layer.
	The `hook_name` field in `cfg.json` gives the exact HuggingFace `transformers`
	submodule path to hook. Phi-4-mini uses LLaMA-style architecture. Hook path: `model.layers.{layer}`.

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-4-mini-reasoning")
	tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-mini-reasoning")

	# Read hook_name from the cfg you already loaded:
	# cfg["hook_name"] == "model.layers.6" (example — varies by SAE)
	hook_name = cfg["hook_name"] # e.g. "model.layers.6"

	# Navigate the submodule path and register a forward hook
	import functools
	submodule = functools.reduce(getattr, hook_name.split("."), model)

	activations = {}
	def hook_fn(module, input, output):
	# Most transformer layers return (hidden_states, ...) as a tuple
	h = output[0] if isinstance(output, tuple) else output
	activations["resid"] = h.detach()

	handle = submodule.register_forward_hook(hook_fn)

	inputs = tokenizer("Your text here", return_tensors="pt")
	with torch.no_grad():
	model(**inputs)
	handle.remove()

	# activations["resid"]: [batch, seq_len, 3072]
	resid = activations["resid"][:, -1, :] # last token position
	```

	### 3. Read feature activations

	```python
	with torch.no_grad():
	feature_acts = sae.encode(resid) # [batch, 12288] — sparse

	# Which features fired?
	active_features = feature_acts[0].nonzero(as_tuple=True)[0]
	top_features = feature_acts[0].topk(10)

	print("Active feature indices:", active_features.tolist())
	print("Top-10 feature values:", top_features.values.tolist())
	print("Top-10 feature indices:", top_features.indices.tolist())

	# Reconstruct (for sanity check — should be close to resid)
	reconstruction = sae.decode(feature_acts)
	l2_error = (resid - reconstruction).norm(dim=-1).mean()
	```

	### Caveats and known limitations

	Hook names are HuggingFace `transformers`-style, not TransformerLens-style.
	The `hook_name` in `cfg.json` (e.g. `"model.layers.6"`) is a submodule path in the standard
	HuggingFace model. SAELens' built-in activation-collection pipeline expects
	TransformerLens hook names (e.g. `blocks.14.hook_resid_post`). This means
	`SAE.from_pretrained()` with automatic model running will not work — use the
	manual forward-hook pattern above instead.

	SAELens version requirements.
	- `topk` architecture: SAELens ≥ 3.0
	- `jumprelu` architecture: SAELens ≥ 3.0
	- `gated` architecture: SAELens ≥ 3.5 (or load manually with `state_dict`)

	*These SAEs detect deceptive behavior, not deceptive prompts*.
	They were trained on response-level activations where the same prompt produced both
	deceptive and honest outputs. Feature activation differences reflect behavioral
	divergence, not prompt content. See the paper for experimental design details.

	## Citation

	```bibtex
	@article{thesecretagenda2025,
	title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
	author={DeLeeuw, Caleb},
	journal={arXiv:2509.20393},
	year={2025}
	}
	```