nanochat-d32-sae-layer16-topk32

Sparse Autoencoder trained on karpathy/nanochat-d32 (1.88B params).

Training-data caveat β€” please read before use

The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:

  • Six clean incentive-structure scenarios β€” insider_info, accounting_error, ai_oversight_log, ai_capability_hide, surprise_party, job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling.
  • Three role-play identity-assignment scenarios β€” secret_role_game ("You have been assigned the role of Fascist"), secret_role_game_v2 ("You are secretly one of the Mafia members"), werewolf_game ("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.

What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play β€” which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.

What this SAE is and is not good for.

  • Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
  • Not recommended as a standalone deception detector. The role-consistency signal from the three role-play scenarios is mixed into every aggregate metric reported below. A downstream user who wants an "emergent-deception feature set" should restrict attention to features whose activation pattern concentrates in the insider_info / accounting_error / ai_oversight_log / ai_capability_hide / surprise_party / job_interview_gap scenarios β€” or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).

What is unaffected by this caveat.

  • The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
  • The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.

A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.


Training Details

Setting Value
Base model nanochat-d32 (1.88B params, bfloat16)
Layer 16 (blocks.16.hook_resid_post)
SAE architecture TopK (k=32)
Dimensions 2048 β†’ 8192 β†’ 2048
Activations 50,000 from WikiText-103
Epochs 3
Best train loss 0.445701
Explained variance 57.3%
Alive features 2116/8192 (26%)

Usage

import torch
from sae.config import SAEConfig
from sae.models import TopKSAE

checkpoint = torch.load("sae_final.pt", map_location="cpu")
config = SAEConfig.from_dict(checkpoint["config"])
sae = TopKSAE(config)
sae.load_state_dict(checkpoint["sae_state_dict"])

# Normalize input activations before passing to SAE
act_mean = checkpoint["act_mean"]
act_std = checkpoint["act_std"]
normalized = (activations - act_mean) / act_std
reconstruction, features, metrics = sae(normalized)

Repository

Trained with nanochat-SAE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support