nanochat-d32-sae-layer16-topk32
Sparse Autoencoder trained on karpathy/nanochat-d32 (1.88B params).
Training-data caveat β please read before use
The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:
- Six clean incentive-structure scenarios β
insider_info,accounting_error,ai_oversight_log,ai_capability_hide,surprise_party,job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling. - Three role-play identity-assignment scenarios β
secret_role_game("You have been assigned the role of Fascist"),secret_role_game_v2("You are secretly one of the Mafia members"),werewolf_game("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.
What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play β which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.
What this SAE is and is not good for.
- Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
- Not recommended as a standalone deception detector. The
role-consistency signal from the three role-play scenarios is mixed
into every aggregate metric reported below. A downstream user who
wants an "emergent-deception feature set" should restrict attention
to features whose activation pattern concentrates in the
insider_info/accounting_error/ai_oversight_log/ai_capability_hide/surprise_party/job_interview_gapscenarios β or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).
What is unaffected by this caveat.
- The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.
A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.
Training Details
| Setting | Value |
|---|---|
| Base model | nanochat-d32 (1.88B params, bfloat16) |
| Layer | 16 (blocks.16.hook_resid_post) |
| SAE architecture | TopK (k=32) |
| Dimensions | 2048 β 8192 β 2048 |
| Activations | 50,000 from WikiText-103 |
| Epochs | 3 |
| Best train loss | 0.445701 |
| Explained variance | 57.3% |
| Alive features | 2116/8192 (26%) |
Usage
import torch
from sae.config import SAEConfig
from sae.models import TopKSAE
checkpoint = torch.load("sae_final.pt", map_location="cpu")
config = SAEConfig.from_dict(checkpoint["config"])
sae = TopKSAE(config)
sae.load_state_dict(checkpoint["sae_state_dict"])
# Normalize input activations before passing to SAE
act_mean = checkpoint["act_mean"]
act_std = checkpoint["act_std"]
normalized = (activations - act_mean) / act_std
reconstruction, features, metrics = sae(normalized)
Repository
Trained with nanochat-SAE.