- gemma4_e2b_v4input_topk_k32_L25_bnb4bit
gemma4_e2b_v4input_topk_k32_L25_bnb4bit
Sparse Autoencoder trained on residual-stream activations from google/gemma-4-E2B at layer 25, extracted during 4-bit quantized inference.
Classification methodology (please read before using)
The honest-vs-deceptive labels on the training and evaluation pools for this SAE were assigned by a strict keyword match on the first 30 characters of each model completion. Consumers of this SAE should know the label-generating process up front:
- Keyword classifier has high precision on its decisive labels
but low recall (drops completions that commit via paraphrase as
"ambiguous"). Scenarios without single-word anchors (e.g.,
day_night) particularly under-recall. - Multi-judge consortium re-classification has been completed on
the companion JSONL for this run (
*_completions.jsonl) using Gemini 2.5 Flash + Claude Haiku (Claude Sonnet deferred for credit conservation; scheduled re-run). The consortium-labelled JSONL is committed alongside this SAE's result file (*_multijudge_batched.jsonl). - Both LLM judges agreed with the keyword classifier on ~90% of keyword-decisive samples. Keyword's committed labels are defensible. The main difference is recall: the consortium recovers a meaningful fraction of the keyword-"ambiguous" bin as decisive labels, which matters for downstream probe / SAE- training pool size.
- Users who want a richer pool can use the consortium-labelled
JSONL directly via the upstream repo's
run_phase_c_consortium_behavior_sae.pyre-trainer, which re-pools under consortium labels and re-forward- passes the stored completions for residual caching. - The multijudge module is reusable:
run_multijudge_classification.pywith--judges keyword,behavioral,gemini_batch,claude_haiku_batchand--batch-size 10(10-sample batches for credit efficiency).
Important: 4-bit quantization caveat
Activations for this SAE were collected from the model loaded in bitsandbytes 4-bit nf4 with double-quant, fp16 compute. Findings characterize the quantized checkpoint, not the reference fp16 deployment. Cross-precision transfer is untested.
Exact training configuration
Base model:
google/gemma-4-E2B(Gemma 4 E2B, 5.1B params, 35 layers, d_model=1536)GPU: NVIDIA GeForce GTX 1650 Ti with Max-Q Design (4 GB VRAM, driver 581.57, compute capability 7.5 / Turing)
Software stack: torch 2.10.0+cu128, transformers 5.5.4, bitsandbytes 0.44+ (sidecar venv; main repo venv uses transformer_lens 2.x which pins transformers 4.x and does not recognize
model_type=gemma4)Quantization config:
BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16, )Effective weight precision: ~3.5 bits/parameter (nf4 + double-quant on block scales). Activations and residuals are fp16 throughout.
Portability theory β will this SAE work at other precisions?
Short answer: untested. The SAE feature directions were learned from 4-bit+fp16-compute residuals. Whether they transfer to other precisions is an open empirical question.
Expected portability, in descending order of confidence:
- Same model, same precision, different GPU (e.g. H100): HIGH
confidence of clean transfer. The SAE is a small
nn.Linearstack; fp16 residuals from the same nf4 forward pass should be near-identical. - Same model, 8-bit (
load_in_8bit=True), fp16 compute: MEDIUM-HIGH confidence of partial transfer. 8-bit preserves more weight precision than 4-bit; fp16 compute path unchanged. Likely some EV degradation. - Same model, unquantized fp16: MEDIUM confidence. Quantization noise is typically <5% of residual magnitude, so fp16 residuals live in the same neighborhood but not identical. Top-feature identities likely still meaningful; d_max values may shift.
- Same model, bf16 compute: MEDIUM confidence. Wider exponent, fewer mantissa bits; similar magnitude but different fine structure.
- Same model, fp8 compute (H100 native path): LOW confidence. fp8 introduces its own quantization noise structure.
- E2B-IT (instruction-tuned, same d_model): LOW confidence (likely no transfer). Different post-training distribution. Cross-model SAE transfer has been shown to fail even at shared d_in in prior work on deception-detection SAE decoders (near-chance transfer accuracies across Llama / TinyLlama / Qwen at d_in=2048). A separate IT SAE is published under a different HF repo name.
- E4B (larger variant): Zero transfer. Different d_model (2560 vs 1536), different layer count, different residual distribution. Retraining required.
Recommended portability test (not yet run β requires >8 GB VRAM)
- Load
google/gemma-4-E2Bin fp16 withoutBitsAndBytesConfig. - Run the same prompts used during SAE training through the fp16 forward pass, capturing residuals at layer 25.
- Feed those residuals through this SAE.
- Measure explained variance, L0 sparsity, alive-feature count.
- Compare to training-time metrics (documented below).
If EV drops >20% or alive-features shift >50%, retrain on fp16 residuals from scratch rather than using this checkpoint.
What this means for scientific claims
Every result reported with this SAE characterizes the 4-bit quantized Gemma 4 E2B checkpoint. Findings may or may not hold for other precisions. See HARDWARE_AND_PRECISION_PORTABILITY.md in the companion source repository for the full precision-portability theorization (exact GPU model, driver, CUDA runtime, sidecar venv spec, and a 7-regime expected-portability table).
Architecture
- architecture: TopK SAE (k=32)
- d_in: 1536
- d_sae: 6144 (expansion factor 4x)
- dtype: float32
Training distribution
V4 input-variation (neutral + V0 threat + V3 lie) x 50 trivia, all token positions, 4-bit quantized inference
SAE quality
- explained_variance: 0.996
- L0 (mean active features per sample): 32.0
- alive features: 394
- MSE: 0.01462
- num training activations: 3619
- train time: 109.1s
Hypothesis-relevant d measurements
SAE type: v4input
Behavior split (primary claim β same-prompt V3 temperature-sampled honest vs deceptive completions):
(behavior split not applicable to this SAE type)
Input-variation split (DIAGNOSTIC β different prompts, neutral vs V0/V3 sandbag templates):
- input neutral-vs-V0 d_max: 1.4726576805114746
- input neutral-vs-V3 d_max: 2.9800453186035156
Diagnostic numbers exist to test the working hypothesis that residual-stream SAEs on these layers primarily encode input conditions rather than generation processes. A SAE trained only on behavior-split activations that still separates input conditions supports the hypothesis.
Downstream test results (this session)
T6 β V3 emergent-bridge (architecture-generalization)
Applies this SAE's features to same-prompt temperature-sampled honest-vs-deceptive behavior activations. Tests whether input-variation features transfer to behavior classification.
- Neutral vs V0 input-split d_max (matched training): 7.81
- Neutral vs V3 input-split d_max: 9.32
- Same-prompt behavior-split d through this SAE: 0.49
- Transfer ratio (behavior d / input-V0 d): 0.063 (~6.3%)
- Top-5 feature-ID overlap (behavior vs V4-V0 input): 1/5
Interpretation: V4-input SAE features that cleanly separate instructional templates do NOT transfer to separating same-prompt honest-vs-deceptive behavior. This is the 6th architecture to show this pattern in the upstream repo (after Qwen2.5-0.5B, Qwen2.5-1.5B, Qwen3-0.6B-Base, and two Llama-3.2-1B layers). Supports the working hypothesis that residual-stream SAEs on mixed input-variation data encode input conditions, not generation processes.
T5 β Directional steering via decoded direction
- cos(this-SAE-decoded direction, raw mean-diff) = 0.289 -> the decoded direction loses most of the raw direction's alignment; expected under polysemantic-superposition.
- Alpha sweep through this SAE's decoded direction:
| alpha | honest_rate |
|---|---|
| -2.0 | 0.702 |
| -1.0 | 0.816 |
| -0.5 | 0.640 |
| +0.0 | 0.776 |
| +0.5 | 0.750 |
| +1.0 | 0.745 |
| +2.0 | 0.574 |
Honest rate is non-monotonic across alpha on Gemma 4 E2B; the decoded direction does NOT produce clean causal steering on this architecture at 4-bit. Consistent with the GF-4 chance-level probe result below.
GF-4 β Probe comparison (5-fold CV logistic)
Held-out same-prompt behavior split (n=20/class).
| representation | balanced accuracy | AUROC |
|---|---|---|
| raw | 0.475 Β± 0.215 | 0.500 |
| v4_input_sae (this SAE) | 0.500 Β± 0.137 | 0.463 |
| gf1_behavior_sae | 0.525 Β± 0.094 | 0.525 |
| control_sae | 0.450 Β± 0.187 | 0.537 |
Where this SAE sits
Part of the first-ever Gemma 4 SAE set (5 repos under Solshine/). Companion SAEs trained on the same base model + layer but different training distributions are linked from the upstream reproduction guide. For the full cross-SAE matrix, per-scenario behavior pool breakdown, and the Qwen/Llama-family cross-architecture comparison, see:
GEMMA4_SAE_REPRODUCTION.mdβ single-file loading recipe for all 5 SAEsGEMMA4_LOADING_TECHNIQUE.mdβ hardware + sidecar-venv setup on 4 GB VRAMHARDWARE_AND_PRECISION_PORTABILITY.mdβ full training-time hardware, software versions, and precision-portability theorization
Loading (sidecar venv + HF Transformers)
# Requires transformers >= 5.x and bitsandbytes for the base model,
# plus a TopKSAE-compatible class. A minimal TopKSAE is shipped in
# the companion source repository at sae/models.py.
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-E2B", quantization_config=bnb,
device_map={"": "cuda"}, low_cpu_mem_usage=True)
# Download and load the SAE weights
from huggingface_hub import hf_hub_download
sae_pt = hf_hub_download(
repo_id="Solshine/gemma4_e2b_v4input_topk_k32_L25_bnb4bit", filename="gemma4_e2b_v4input_topk_k32_L25_bnb4bit.pt")
state = torch.load(sae_pt, map_location="cpu", weights_only=True)
# Use sae.models.TopKSAE from the upstream repo, configured per cfg.json.
Citation / context
This SAE was trained as part of the green-field behavior-SAE
investigation described in:
https://github.com/SolshineCode/deception-nanochat-sae-research
(see papers/specificity_gap/ and experiments/gf1_behavior_sae/).
Trained: 2026-04-22T01:19:54.241752+00:00
Model tree for Solshine/gemma4_e2b_v4input_topk_k32_L25_bnb4bit
Base model
google/gemma-4-E2B