gemma4_e2b_it_v4input_topk_k32_L17_bnb4bit
Sparse Autoencoder trained on residual-stream activations from google/gemma-4-E2B-it at layer 17, extracted during 4-bit quantized inference.
Classification methodology (please read before using)
The honest-vs-deceptive labels on the training and evaluation pools for this SAE were assigned by a strict keyword match on the first 30 characters of each model completion. Consumers of this SAE should know the label-generating process up front:
- Keyword classifier has high precision on its decisive labels
but low recall (drops completions that commit via paraphrase as
"ambiguous"). Scenarios without single-word anchors (e.g.,
day_night) particularly under-recall. - Multi-judge consortium re-classification has been completed on
the companion JSONL for this run (
*_completions.jsonl) using Gemini 2.5 Flash + Claude Haiku (Claude Sonnet deferred for credit conservation; scheduled re-run). The consortium-labelled JSONL is committed alongside this SAE's result file (*_multijudge_batched.jsonl). - Both LLM judges agreed with the keyword classifier on ~90% of keyword-decisive samples. Keyword's committed labels are defensible. The main difference is recall: the consortium recovers a meaningful fraction of the keyword-"ambiguous" bin as decisive labels, which matters for downstream probe / SAE- training pool size.
- Users who want a richer pool can use the consortium-labelled
JSONL directly via the upstream repo's
run_phase_c_consortium_behavior_sae.pyre-trainer, which re-pools under consortium labels and re-forward- passes the stored completions for residual caching. - The multijudge module is reusable:
run_multijudge_classification.pywith--judges keyword,behavioral,gemini_batch,claude_haiku_batchand--batch-size 10(10-sample batches for credit efficiency).
Important: 4-bit quantization caveat
Activations for this SAE were collected from the model loaded in bitsandbytes 4-bit nf4 with double-quant, fp16 compute. Findings characterize the quantized checkpoint, not the reference fp16 deployment. Cross-precision transfer is untested.
Exact training configuration
Base model:
google/gemma-4-E2B-it(Gemma 4 E2B, 5.1B params, 35 layers, d_model=1536)GPU: NVIDIA GeForce GTX 1650 Ti with Max-Q Design (4 GB VRAM, driver 581.57, compute capability 7.5 / Turing)
Software stack: torch 2.10.0+cu128, transformers 5.5.4, bitsandbytes 0.44+ (sidecar venv; main repo venv uses transformer_lens 2.x which pins transformers 4.x and does not recognize
model_type=gemma4)Quantization config:
BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16, )Effective weight precision: ~3.5 bits/parameter (nf4 + double-quant on block scales). Activations and residuals are fp16 throughout.
Portability theory β will this SAE work at other precisions?
Short answer: untested. The SAE feature directions were learned from 4-bit+fp16-compute residuals. Whether they transfer to other precisions is an open empirical question.
Expected portability, in descending order of confidence:
- Same model, same precision, different GPU (e.g. H100): HIGH
confidence of clean transfer. The SAE is a small
nn.Linearstack; fp16 residuals from the same nf4 forward pass should be near-identical. - Same model, 8-bit (
load_in_8bit=True), fp16 compute: MEDIUM-HIGH confidence of partial transfer. 8-bit preserves more weight precision than 4-bit; fp16 compute path unchanged. Likely some EV degradation. - Same model, unquantized fp16: MEDIUM confidence. Quantization noise is typically <5% of residual magnitude, so fp16 residuals live in the same neighborhood but not identical. Top-feature identities likely still meaningful; d_max values may shift.
- Same model, bf16 compute: MEDIUM confidence. Wider exponent, fewer mantissa bits; similar magnitude but different fine structure.
- Same model, fp8 compute (H100 native path): LOW confidence. fp8 introduces its own quantization noise structure.
- E2B-IT (instruction-tuned, same d_model): LOW confidence (likely no transfer). Different post-training distribution. Cross-model SAE transfer has been shown to fail even at shared d_in in prior work on deception-detection SAE decoders (near-chance transfer accuracies across Llama / TinyLlama / Qwen at d_in=2048). A separate IT SAE is published under a different HF repo name.
- E4B (larger variant): Zero transfer. Different d_model (2560 vs 1536), different layer count, different residual distribution. Retraining required.
Recommended portability test (not yet run β requires >8 GB VRAM)
- Load
google/gemma-4-E2B-itin fp16 withoutBitsAndBytesConfig. - Run the same prompts used during SAE training through the fp16 forward pass, capturing residuals at layer 17.
- Feed those residuals through this SAE.
- Measure explained variance, L0 sparsity, alive-feature count.
- Compare to training-time metrics (documented below).
If EV drops >20% or alive-features shift >50%, retrain on fp16 residuals from scratch rather than using this checkpoint.
What this means for scientific claims
Every result reported with this SAE characterizes the 4-bit quantized Gemma 4 E2B checkpoint. Findings may or may not hold for other precisions. See HARDWARE_AND_PRECISION_PORTABILITY.md in the companion source repository for the full precision-portability theorization (exact GPU model, driver, CUDA runtime, sidecar venv spec, and a 7-regime expected-portability table).
Architecture
- architecture: TopK SAE (k=32)
- d_in: 1536
- d_sae: 6144 (expansion factor 4x)
- dtype: float32
Training distribution
V4 input-variation (neutral + V0 threat + V3 lie) x 50 trivia, all token positions, 4-bit quantized inference
SAE quality
- explained_variance: 0.989
- L0 (mean active features per sample): 32.0
- alive features: 58
- MSE: 0.03036
- num training activations: 3469
- train time: 105.4s
Hypothesis-relevant d measurements
SAE type: v4input
Behavior split (primary claim β same-prompt V3 temperature-sampled honest vs deceptive completions):
(behavior split not applicable to this SAE type)
Input-variation split (DIAGNOSTIC β different prompts, neutral vs V0/V3 sandbag templates):
- input neutral-vs-V0 d_max: 5.408595085144043
- input neutral-vs-V3 d_max: 7.114202499389648
Diagnostic numbers exist to test the working hypothesis that residual-stream SAEs on these layers primarily encode input conditions rather than generation processes. A SAE trained only on behavior-split activations that still separates input conditions supports the hypothesis.
Loading (sidecar venv + HF Transformers)
# Requires transformers >= 5.x and bitsandbytes for the base model,
# plus a TopKSAE-compatible class. A minimal TopKSAE is shipped in
# the companion source repository at sae/models.py.
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-E2B-it", quantization_config=bnb,
device_map={"": "cuda"}, low_cpu_mem_usage=True)
# Download and load the SAE weights
from huggingface_hub import hf_hub_download
sae_pt = hf_hub_download(
repo_id="Solshine/gemma4_e2b_it_v4input_topk_k32_L17_bnb4bit", filename="gemma4_e2b_it_v4input_topk_k32_L17_bnb4bit.pt")
state = torch.load(sae_pt, map_location="cpu", weights_only=True)
# Use sae.models.TopKSAE from the upstream repo, configured per cfg.json.
Citation / context
This SAE was trained as part of the green-field behavior-SAE
investigation described in:
https://github.com/SolshineCode/deception-nanochat-sae-research
(see papers/specificity_gap/ and experiments/gf1_behavior_sae/).
Trained: 2026-04-22T01:19:50.799985+00:00