gemma4_e2b_it_v4input_topk_k32_L17_bnb4bit

Sparse Autoencoder trained on residual-stream activations from google/gemma-4-E2B-it at layer 17, extracted during 4-bit quantized inference.

Classification methodology (please read before using)

The honest-vs-deceptive labels on the training and evaluation pools for this SAE were assigned by a strict keyword match on the first 30 characters of each model completion. Consumers of this SAE should know the label-generating process up front:

Keyword classifier has high precision on its decisive labels but low recall (drops completions that commit via paraphrase as "ambiguous"). Scenarios without single-word anchors (e.g., day_night) particularly under-recall.
Multi-judge consortium re-classification has been completed on the companion JSONL for this run (*_completions.jsonl) using Gemini 2.5 Flash + Claude Haiku (Claude Sonnet deferred for credit conservation; scheduled re-run). The consortium-labelled JSONL is committed alongside this SAE's result file (*_multijudge_batched.jsonl).
Both LLM judges agreed with the keyword classifier on ~90% of keyword-decisive samples. Keyword's committed labels are defensible. The main difference is recall: the consortium recovers a meaningful fraction of the keyword-"ambiguous" bin as decisive labels, which matters for downstream probe / SAE- training pool size.
Users who want a richer pool can use the consortium-labelled JSONL directly via the upstream repo's run_phase_c_consortium_behavior_sae.py re-trainer, which re-pools under consortium labels and re-forward- passes the stored completions for residual caching.
The multijudge module is reusable: run_multijudge_classification.py with --judges keyword,behavioral,gemini_batch,claude_haiku_batch and --batch-size 10 (10-sample batches for credit efficiency).

Important: 4-bit quantization caveat

Activations for this SAE were collected from the model loaded in bitsandbytes 4-bit nf4 with double-quant, fp16 compute. Findings characterize the quantized checkpoint, not the reference fp16 deployment. Cross-precision transfer is untested.

Exact training configuration

Base model: google/gemma-4-E2B-it (Gemma 4 E2B, 5.1B params, 35 layers, d_model=1536)
GPU: NVIDIA GeForce GTX 1650 Ti with Max-Q Design (4 GB VRAM, driver 581.57, compute capability 7.5 / Turing)
Software stack: torch 2.10.0+cu128, transformers 5.5.4, bitsandbytes 0.44+ (sidecar venv; main repo venv uses transformer_lens 2.x which pins transformers 4.x and does not recognize model_type=gemma4)

Quantization config:

BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

Effective weight precision: ~3.5 bits/parameter (nf4 + double-quant on block scales). Activations and residuals are fp16 throughout.

Portability theory — will this SAE work at other precisions?

Short answer: untested. The SAE feature directions were learned from 4-bit+fp16-compute residuals. Whether they transfer to other precisions is an open empirical question.

Expected portability, in descending order of confidence:

Same model, same precision, different GPU (e.g. H100): HIGH confidence of clean transfer. The SAE is a small nn.Linear stack; fp16 residuals from the same nf4 forward pass should be near-identical.
Same model, 8-bit (load_in_8bit=True), fp16 compute: MEDIUM-HIGH confidence of partial transfer. 8-bit preserves more weight precision than 4-bit; fp16 compute path unchanged. Likely some EV degradation.
Same model, unquantized fp16: MEDIUM confidence. Quantization noise is typically <5% of residual magnitude, so fp16 residuals live in the same neighborhood but not identical. Top-feature identities likely still meaningful; d_max values may shift.
Same model, bf16 compute: MEDIUM confidence. Wider exponent, fewer mantissa bits; similar magnitude but different fine structure.
Same model, fp8 compute (H100 native path): LOW confidence. fp8 introduces its own quantization noise structure.
E2B-IT (instruction-tuned, same d_model): LOW confidence (likely no transfer). Different post-training distribution. Cross-model SAE transfer has been shown to fail even at shared d_in in prior work on deception-detection SAE decoders (near-chance transfer accuracies across Llama / TinyLlama / Qwen at d_in=2048). A separate IT SAE is published under a different HF repo name.
E4B (larger variant): Zero transfer. Different d_model (2560 vs 1536), different layer count, different residual distribution. Retraining required.

Recommended portability test (not yet run — requires >8 GB VRAM)

Load google/gemma-4-E2B-it in fp16 without BitsAndBytesConfig.
Run the same prompts used during SAE training through the fp16 forward pass, capturing residuals at layer 17.
Feed those residuals through this SAE.
Measure explained variance, L0 sparsity, alive-feature count.
Compare to training-time metrics (documented below).

If EV drops >20% or alive-features shift >50%, retrain on fp16 residuals from scratch rather than using this checkpoint.

What this means for scientific claims

Every result reported with this SAE characterizes the 4-bit quantized Gemma 4 E2B checkpoint. Findings may or may not hold for other precisions. See HARDWARE_AND_PRECISION_PORTABILITY.md in the companion source repository for the full precision-portability theorization (exact GPU model, driver, CUDA runtime, sidecar venv spec, and a 7-regime expected-portability table).

Architecture

architecture: TopK SAE (k=32)
d_in: 1536
d_sae: 6144 (expansion factor 4x)
dtype: float32

Training distribution

V4 input-variation (neutral + V0 threat + V3 lie) x 50 trivia, all token positions, 4-bit quantized inference

SAE quality

explained_variance: 0.989
L0 (mean active features per sample): 32.0
alive features: 58
MSE: 0.03036
num training activations: 3469
train time: 105.4s

Hypothesis-relevant d measurements

SAE type: v4input

Behavior split (primary claim — same-prompt V3 temperature-sampled honest vs deceptive completions):

(behavior split not applicable to this SAE type)

Input-variation split (DIAGNOSTIC — different prompts, neutral vs V0/V3 sandbag templates):

input neutral-vs-V0 d_max: 5.408595085144043
input neutral-vs-V3 d_max: 7.114202499389648

Diagnostic numbers exist to test the working hypothesis that residual-stream SAEs on these layers primarily encode input conditions rather than generation processes. A SAE trained only on behavior-split activations that still separates input conditions supports the hypothesis.

Loading (sidecar venv + HF Transformers)

# Requires transformers >= 5.x and bitsandbytes for the base model,
# plus a TopKSAE-compatible class. A minimal TopKSAE is shipped in
# the companion source repository at sae/models.py.
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-E2B-it", quantization_config=bnb,
    device_map={"": "cuda"}, low_cpu_mem_usage=True)

# Download and load the SAE weights
from huggingface_hub import hf_hub_download
sae_pt = hf_hub_download(
    repo_id="Solshine/gemma4_e2b_it_v4input_topk_k32_L17_bnb4bit", filename="gemma4_e2b_it_v4input_topk_k32_L17_bnb4bit.pt")
state = torch.load(sae_pt, map_location="cpu", weights_only=True)

# Use sae.models.TopKSAE from the upstream repo, configured per cfg.json.

Citation / context

This SAE was trained as part of the green-field behavior-SAE investigation described in: https://github.com/SolshineCode/deception-nanochat-sae-research (see papers/specificity_gap/ and experiments/gf1_behavior_sae/).

Trained: 2026-04-22T01:19:50.799985+00:00

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Solshine/gemma4_e2b_it_v4input_topk_k32_L17_bnb4bit

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Finetuned

(202)

this model