gemma4_e2b_v4input_topk_k32_L17_bnb4bit

Sparse Autoencoder trained on residual-stream activations from google/gemma-4-E2B at layer 17, extracted during 4-bit quantized inference.

Classification methodology (please read before using)

The honest-vs-deceptive labels on the training and evaluation pools for this SAE were assigned by a strict keyword match on the first 30 characters of each model completion. Consumers of this SAE should know the label-generating process up front:

  • Keyword classifier has high precision on its decisive labels but low recall (drops completions that commit via paraphrase as "ambiguous"). Scenarios without single-word anchors (e.g., day_night) particularly under-recall.
  • Multi-judge consortium re-classification has been completed on the companion JSONL for this run (*_completions.jsonl) using Gemini 2.5 Flash + Claude Haiku (Claude Sonnet deferred for credit conservation; scheduled re-run). The consortium-labelled JSONL is committed alongside this SAE's result file (*_multijudge_batched.jsonl).
  • Both LLM judges agreed with the keyword classifier on ~90% of keyword-decisive samples. Keyword's committed labels are defensible. The main difference is recall: the consortium recovers a meaningful fraction of the keyword-"ambiguous" bin as decisive labels, which matters for downstream probe / SAE- training pool size.
  • Users who want a richer pool can use the consortium-labelled JSONL directly via the upstream repo's run_phase_c_consortium_behavior_sae.py re-trainer, which re-pools under consortium labels and re-forward- passes the stored completions for residual caching.
  • The multijudge module is reusable: run_multijudge_classification.py with --judges keyword,behavioral,gemini_batch,claude_haiku_batch and --batch-size 10 (10-sample batches for credit efficiency).

Important: 4-bit quantization caveat

Activations for this SAE were collected from the model loaded in bitsandbytes 4-bit nf4 with double-quant, fp16 compute. Findings characterize the quantized checkpoint, not the reference fp16 deployment. Cross-precision transfer is untested.

Exact training configuration

  • Base model: google/gemma-4-E2B (Gemma 4 E2B, 5.1B params, 35 layers, d_model=1536)

  • GPU: NVIDIA GeForce GTX 1650 Ti with Max-Q Design (4 GB VRAM, driver 581.57, compute capability 7.5 / Turing)

  • Software stack: torch 2.10.0+cu128, transformers 5.5.4, bitsandbytes 0.44+ (sidecar venv; main repo venv uses transformer_lens 2.x which pins transformers 4.x and does not recognize model_type=gemma4)

  • Quantization config:

    BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.float16,
    )
    
  • Effective weight precision: ~3.5 bits/parameter (nf4 + double-quant on block scales). Activations and residuals are fp16 throughout.

Portability theory β€” will this SAE work at other precisions?

Short answer: untested. The SAE feature directions were learned from 4-bit+fp16-compute residuals. Whether they transfer to other precisions is an open empirical question.

Expected portability, in descending order of confidence:

  1. Same model, same precision, different GPU (e.g. H100): HIGH confidence of clean transfer. The SAE is a small nn.Linear stack; fp16 residuals from the same nf4 forward pass should be near-identical.
  2. Same model, 8-bit (load_in_8bit=True), fp16 compute: MEDIUM-HIGH confidence of partial transfer. 8-bit preserves more weight precision than 4-bit; fp16 compute path unchanged. Likely some EV degradation.
  3. Same model, unquantized fp16: MEDIUM confidence. Quantization noise is typically <5% of residual magnitude, so fp16 residuals live in the same neighborhood but not identical. Top-feature identities likely still meaningful; d_max values may shift.
  4. Same model, bf16 compute: MEDIUM confidence. Wider exponent, fewer mantissa bits; similar magnitude but different fine structure.
  5. Same model, fp8 compute (H100 native path): LOW confidence. fp8 introduces its own quantization noise structure.
  6. E2B-IT (instruction-tuned, same d_model): LOW confidence (likely no transfer). Different post-training distribution. Cross-model SAE transfer has been shown to fail even at shared d_in in prior work on deception-detection SAE decoders (near-chance transfer accuracies across Llama / TinyLlama / Qwen at d_in=2048). A separate IT SAE is published under a different HF repo name.
  7. E4B (larger variant): Zero transfer. Different d_model (2560 vs 1536), different layer count, different residual distribution. Retraining required.

Recommended portability test (not yet run β€” requires >8 GB VRAM)

  1. Load google/gemma-4-E2B in fp16 without BitsAndBytesConfig.
  2. Run the same prompts used during SAE training through the fp16 forward pass, capturing residuals at layer 17.
  3. Feed those residuals through this SAE.
  4. Measure explained variance, L0 sparsity, alive-feature count.
  5. Compare to training-time metrics (documented below).

If EV drops >20% or alive-features shift >50%, retrain on fp16 residuals from scratch rather than using this checkpoint.

What this means for scientific claims

Every result reported with this SAE characterizes the 4-bit quantized Gemma 4 E2B checkpoint. Findings may or may not hold for other precisions. See HARDWARE_AND_PRECISION_PORTABILITY.md in the companion source repository for the full precision-portability theorization (exact GPU model, driver, CUDA runtime, sidecar venv spec, and a 7-regime expected-portability table).

Architecture

  • architecture: TopK SAE (k=32)
  • d_in: 1536
  • d_sae: 6144 (expansion factor 4x)
  • dtype: float32

Training distribution

V4 input-variation (neutral + V0 threat + V3 lie) x 50 trivia, all token positions, 4-bit quantized inference

SAE quality

  • explained_variance: 0.993
  • L0 (mean active features per sample): 32.0
  • alive features: 241
  • MSE: 0.02123
  • num training activations: 3619
  • train time: 110.0s

Hypothesis-relevant d measurements

SAE type: v4input

Behavior split (primary claim β€” same-prompt V3 temperature-sampled honest vs deceptive completions):

(behavior split not applicable to this SAE type)

Input-variation split (DIAGNOSTIC β€” different prompts, neutral vs V0/V3 sandbag templates):

  • input neutral-vs-V0 d_max: 7.808701515197754
  • input neutral-vs-V3 d_max: 9.32315731048584

Diagnostic numbers exist to test the working hypothesis that residual-stream SAEs on these layers primarily encode input conditions rather than generation processes. A SAE trained only on behavior-split activations that still separates input conditions supports the hypothesis.

Downstream test results (this session)

T6 β€” V3 emergent-bridge (architecture-generalization)

Applies this SAE's features to same-prompt temperature-sampled honest-vs-deceptive behavior activations. Tests whether input-variation features transfer to behavior classification.

  • Neutral vs V0 input-split d_max (matched training): 7.81
  • Neutral vs V3 input-split d_max: 9.32
  • Same-prompt behavior-split d through this SAE: 0.49
  • Transfer ratio (behavior d / input-V0 d): 0.063 (~6.3%)
  • Top-5 feature-ID overlap (behavior vs V4-V0 input): 1/5

Interpretation: V4-input SAE features that cleanly separate instructional templates do NOT transfer to separating same-prompt honest-vs-deceptive behavior. This is the 6th architecture to show this pattern in the upstream repo (after Qwen2.5-0.5B, Qwen2.5-1.5B, Qwen3-0.6B-Base, and two Llama-3.2-1B layers). Supports the working hypothesis that residual-stream SAEs on mixed input-variation data encode input conditions, not generation processes.

T5 β€” Directional steering via decoded direction

  • cos(this-SAE-decoded direction, raw mean-diff) = 0.289 -> the decoded direction loses most of the raw direction's alignment; expected under polysemantic-superposition.
  • Alpha sweep through this SAE's decoded direction:
alpha honest_rate
-2.0 0.702
-1.0 0.816
-0.5 0.640
+0.0 0.776
+0.5 0.750
+1.0 0.745
+2.0 0.574

Honest rate is non-monotonic across alpha on Gemma 4 E2B; the decoded direction does NOT produce clean causal steering on this architecture at 4-bit. Consistent with the GF-4 chance-level probe result below.

GF-4 β€” Probe comparison (5-fold CV logistic)

Held-out same-prompt behavior split (n=20/class).

representation balanced accuracy AUROC
raw 0.475 Β± 0.215 0.500
v4_input_sae (this SAE) 0.500 Β± 0.137 0.463
gf1_behavior_sae 0.525 Β± 0.094 0.525
control_sae 0.450 Β± 0.187 0.537

Where this SAE sits

Part of the first-ever Gemma 4 SAE set (5 repos under Solshine/). Companion SAEs trained on the same base model + layer but different training distributions are linked from the upstream reproduction guide. For the full cross-SAE matrix, per-scenario behavior pool breakdown, and the Qwen/Llama-family cross-architecture comparison, see:

Loading (sidecar venv + HF Transformers)

# Requires transformers >= 5.x and bitsandbytes for the base model,
# plus a TopKSAE-compatible class. A minimal TopKSAE is shipped in
# the companion source repository at sae/models.py.
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-E2B", quantization_config=bnb,
    device_map={"": "cuda"}, low_cpu_mem_usage=True)

# Download and load the SAE weights
from huggingface_hub import hf_hub_download
sae_pt = hf_hub_download(
    repo_id="Solshine/gemma4_e2b_v4input_topk_k32_L17_bnb4bit", filename="gemma4_e2b_v4input_topk_k32_L17_bnb4bit.pt")
state = torch.load(sae_pt, map_location="cpu", weights_only=True)

# Use sae.models.TopKSAE from the upstream repo, configured per cfg.json.

Citation / context

This SAE was trained as part of the green-field behavior-SAE investigation described in: https://github.com/SolshineCode/deception-nanochat-sae-research (see papers/specificity_gap/ and experiments/gf1_behavior_sae/).

Trained: 2026-04-22T01:19:53.071318+00:00

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Solshine/gemma4_e2b_v4input_topk_k32_L17_bnb4bit

Finetuned
(62)
this model