nanochat-d32-sae-layer16-topk32

Sparse Autoencoder trained on karpathy/nanochat-d32 (1.88B params).

Training-data caveat — please read before use

The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:

Six clean incentive-structure scenarios — insider_info, accounting_error, ai_oversight_log, ai_capability_hide, surprise_party, job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling.
Three role-play identity-assignment scenarios — secret_role_game ("You have been assigned the role of Fascist"), secret_role_game_v2 ("You are secretly one of the Mafia members"), werewolf_game ("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.

What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play — which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.

What this SAE is and is not good for.

Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
Not recommended as a standalone deception detector. The role-consistency signal from the three role-play scenarios is mixed into every aggregate metric reported below. A downstream user who wants an "emergent-deception feature set" should restrict attention to features whose activation pattern concentrates in the insider_info / accounting_error / ai_oversight_log / ai_capability_hide / surprise_party / job_interview_gap scenarios — or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).

What is unaffected by this caveat.

The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.

A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.

Training Details

Setting	Value
Base model	nanochat-d32 (1.88B params, bfloat16)
Layer	16 (`blocks.16.hook_resid_post`)
SAE architecture	TopK (k=32)
Dimensions	2048 → 8192 → 2048
Activations	50,000 from WikiText-103
Epochs	3
Best train loss	0.445701
Explained variance	57.3%
Alive features	2116/8192 (26%)

Usage

import torch
from sae.config import SAEConfig
from sae.models import TopKSAE

checkpoint = torch.load("sae_final.pt", map_location="cpu")
config = SAEConfig.from_dict(checkpoint["config"])
sae = TopKSAE(config)
sae.load_state_dict(checkpoint["sae_state_dict"])

# Normalize input activations before passing to SAE
act_mean = checkpoint["act_mean"]
act_std = checkpoint["act_std"]
normalized = (activations - act_mean) / act_std
reconstruction, features, metrics = sae(normalized)

Repository

Trained with nanochat-SAE.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support